Skip to main content

Integration Concept and Value of Zetuo Kunlun Klustron Distributed Database

KlustronAbout 8 min

Integration Concept and Value of Zetuo Kunlun Klustron Distributed Database

The Zetuo Kunlun Klustron distributed database is deeply integrated with and fully compatible with the MySQL and PostgreSQL open-source ecosystems, leveraging these platforms' extensive technological and human resources. This seamless compatibility represents Klustron's significant and distinct advantage. Based on this capability, not only can applications and backend systems originally using MySQL or PostgreSQL be directly used with Klustron without any modifications, but users can also fully utilize the vast technological and talent resources of the MySQL and PostgreSQL open-source ecosystems. These resources include functional extension components and knowledge experience. In light of this, we propose the concept of the 'Zetuo Kunlun Klustron Fusion Data Management and Analysis Platform (FDMAP),' which includes the following four aspects, embodying four key principles:

  1. Integration of multi-model data: relational model, extended by JSON, GIS, vector, and other complex data types.
  2. Integration of data from multiple sources: cluster local, external databases, external object storage.
  3. Integration of open-source ecosystem components: PostGIS, PostgresML, Apache MadLib, PGVector, PGEmbedding, etc.
  4. Integration of user-defined algorithms: multi-language (python, java, perl, lua, javascript, PL/SQL) stored procedures.

Building on these principles, we propose the concept of a scalable AI data infrastructure, which we believe is a fundamental requirement for data management by large models across various industries.

01 Leveraging PostgreSQL Open-Source Ecosystem Extension Components

The compute nodes of Zetuo Kunlun Klustron, known as Klustron-server, maintain the complete extension mechanism of PostgreSQL. As a result, the vast majority of the extension components from the PostgreSQL open-source ecosystem can be seamlessly integrated and function effectively on Klustron-server, with few exceptions such as PostGIS and PGVector. PostGIS, which introduces a new series of GIS data types such as geometry and geography, requires minor modifications to exchange GIS data with Klustron's storage nodes, Klustron-storage, to enable this integration.

PGVector implements vector indexing, but within Klustron, vector indexes must be stored in Klustron-storage nodes, which do not natively support vector indexing. As a result, it is necessary to develop a vector indexing feature compatible with PGVector. The upcoming Klustron 1.3 version will support both PostGIS and PGVector, and the current daily build versions of Klustron already offer PostGIS support, functioning identically to its traditional usage.

By supporting the entire suite of PostgreSQL open-source ecosystem extensions, Klustron not only automatically expands its own capabilities but also extends the capabilities of these components beyond their original limitations to single-server PostgreSQL instances, which are constrained by the compute and storage resources of a single server. Especially for computation-intensive tasks like machine learning, OLAP analysis, and stored procedures, which can significantly consume limited CPU resources, leading to a potential decrease in the transactional performance of PostgreSQL databases. Even with read-write separation, the concurrency of such tasks remains very limited.

However, on the Klustron distributed database cluster, these components can manage and operate on data at an unlimited scale, with the ability to scale computational resources, storage, and network bandwidth as needed elastically. Thus, Klustron enables these components to overcome the scalability challenges of PostgreSQL limited by single-server hardware resources, leveraging the hardware resources of many servers for data management and query analysis functions.

Moreover, Klustron's multi-level parallel query processing and JIT among other query performance enhancement technologies significantly boost the performance of OLAP analysis tasks, stored procedures, and machine learning computations beyond the capabilities of standalone PostgreSQL instances.

This article will not enumerate all automatically supported PostgreSQL components due to their vast number, but it will highlight a few unique components specifically.

1. Machine Learning Components

PostgresML is a well-known machine learning component within the PostgreSQL open-source ecosystem, while Apache MadLib is another machine learning component supported by the Apache community for PostgreSQL. The Klustron distributed database supports these machine learning components, enabling users to run machine learning algorithms directly within the database. The value of this approach is detailed further below.

2. FDW Components

Klustron's compute nodes leverage PostgreSQL's FDW capabilities to access data from a variety of third-party sources, including other database systems, public cloud object storage systems like S3 and OSS, and HBase/Hive. This enables unified querying and OLAP analysis across both local and external data sets. The considerations for using FDW within Klustron are elaborated on in the sections that follow.

3. Stored Procedure Language Component

Klustron supports all of PostgreSQL's stored procedure language components, including PL/SQL, plpython, plperl, pltcl, pllua, plv8, and pljava. This feature empowers users to write stored procedures in PL/SQL, Python, Perl, TCL, Lua, JavaScript, Java, and run them on Klustron's compute nodes, facilitating the execution of user-developed data processing logic within these nodes. Writing stored procedures in languages such as Python and Java enables the use of various algorithm libraries from these languages, allowing for the rapid development of powerful data processing algorithms.

Users can confidently utilize stored procedures extensively within the Klustron cluster for in-database computations and data processing without worrying about excessive consumption of computational resources. This is because it is possible to deploy several compute nodes dedicated to executing stored procedures and to scale up servers as needed. This approach minimizes the need for data movement across the network, reducing bandwidth consumption and latency, and thus, enhancing overall performance.

02 New Paradigms in Data Management

2.1 In-Database Computation and Data Processing

In the era of standalone databases, all data was stored and all transaction processing occurred on the same computer server. The common approach for executing large-scale complex analytical computations was to implement the computational logic within application software modules. The system would then transfer data out of the database to application servers for computation, thus avoiding competition for computational resources with the database system's core tasks (such as transaction processing and data writing). However, this approach came with the cost of moving large volumes of data, creating significant pressure on network bandwidth and substantially increasing the time required for data analysis and processing tasks. Additionally, application servers often lacked the infrastructure for handling massive data volumes found in database systems (such as buffer pools, temporary tables, and external indexes), resulting in poor data processing performance for large datasets and necessitating alternative technological solutions to address performance issues.

Performing analysis and data processing within the database offers the advantage of avoiding the network bandwidth consumption and time delays associated with large-scale data migration. For standalone databases, however, this approach poses problems for handling large volumes of in-database intensive computations, as it can impact OLTP workloads and offers poor scalability—even if the database instance is solely dedicated to analytical workloads, a slight increase in load can exhaust computational resources. Distributed clusters that utilize extensive hardware resources for analysis and computation circumvent these issues, highlighting the benefits of in-database computation once again.

Database systems possess a robust infrastructure for managing and processing massive volumes of data, including buffer pools, temporary tables, and external indexes. Therefore, it is feasible to execute stored procedures, perform OLAP analysis, or undertake machine learning tasks directly within the database, independent of components like message queues (e.g., Kafka, RabbitMQ) or data stream analytics (e.g., Flink). These components represent contemporary mainstream methods for incremental analysis and processing of large data volumes under current technological conditions. Utilizing such a toolchain requires a dedicated data analytics technical team to learn numerous "new technologies" and develop many new data processing modules, in addition to an operations team to maintain these components and ensure their efficient ongoing operation. Moreover, running these components necessitates a significant number of computer servers, substantially increasing the cost of data analysis and processing for users.

Klustron's Solution

Using the Klustron distributed database cluster for in-database analysis and data processing addresses the aforementioned challenges. We believe Klustron can support users in employing a superior technology stack for data processing and analysis. Users need only deploy a set of computational nodes dedicated to running computationally intensive data analysis and processing tasks, including performing SQL-based OLAP data analysis, running machine learning tasks with PostgresML or Apache MadLib, or executing data computation and processing logic in the form of stored procedures. Moreover, these data analysis and processing tasks read data from replicas, thereby entirely avoiding any impact on data writing and transaction processing performance. Additionally, there is no need to install, deploy, or maintain Flink or Kafka clusters, eliminating related personnel costs and the reliability risks associated with these components (the more system components, the greater the need for coordination and the higher the risk of failure). Furthermore, these analysis tasks benefit from Klustron's distributed parallel query processing technology, achieving excellent data analysis and processing performance.

2.2 Enhancing the Relational Data Model with Complex Data Types

In the IT systems across various industries, the relational data model continues to be the most widely applied foundational data model. Complex data types such as GIS, JSON, and vector data possess powerful functionalities and rich application scenarios within their respective specialized domains. However, deploying and maintaining a dedicated database system for any of these complex data types is highly complex and costly. The expenses include not only the cost of server hardware but also significant labor costs for maintenance personnel. A greater complexity and cost arise from the need within application systems to develop additional modules and functionalities to utilize relational data, GIS, JSON, and vector data from multiple database systems, especially in reliably implementing transaction semantics (ACID) for reading and writing data across various database systems. Achieving this in application systems is exceedingly complex, technically challenging, and the reliability and other technical metrics of the application systems must withstand rigorous testing and verification. These complexities result in a significant increase in the development workload for application systems. Moreover, application systems might also require message queues, etcd/zookeeper, and other state synchronization tools, leading to additional hardware expenses and maintenance requirements.

Klustron's Solution

By supporting PostGIS and PGVector, Klustron adds support for GIS and vector data types, on top of the already supported JSON (semi-structured) data type, creating an effect where 1+1>>2 in terms of capabilities. Consequently, Klustron is endowed with robust data management functions. These three complex data types significantly enhance and expand the relational data model, enabling Klustron users to develop highly powerful, rich, flexible, and efficient application software systems. This significantly reduces the development, hardware, and maintenance costs associated with deploying and maintaining multiple database systems, as well as the complexity of coordination and system reliability risks when using multiple database systems.

In summary, Klustron's support for in-database computation and various data models facilitates a new paradigm in data processing. This paradigm is fundamentally different from the widely prevalent external computation models, offering a series of significant advantages and solving various technical challenges that necessitated external computation in the era of centralized databases. This approach is feasible, offering substantial value to users.

03 Utilizing External Data Sources

The practice of reading data from external data sources and integrating it with internal database queries has been present in centralized databases for many years, including in systems such as Oracle, MySQL, and PostgreSQL. Klustron leverages the PostgreSQL ecosystem's FDW (Foreign Data Wrapper) components to read data from various external sources, embodying part of Klustron's integration capability, hence its inclusion in this article.

Often, users wish to establish data tables within Klustron and then import data from third-party sources. This approach allows for not only read-only internal data analysis and processing but also reliable data updates within transactions. However, this method results in duplicating the same data (at least two copies across primary and replica nodes), consuming significant storage resources. Users might decide against importing data into the Klustron cluster due to data security concerns or storage costs, opting instead to directly use Klustron's FDW components. This enables Klustron's computational nodes to read data from third-party sources (including other database systems, HBase/Hive, public cloud object storage, etc.) and use internal data tables for multi-source data execution and join queries.

While this approach may seem ideal, one should not expect global data read consistency when accessing data from external sources, as neither type of external data source supports such an assurance. Some sources may not even support transaction semantics (ACID).

Furthermore, we advise against updating data from multiple external sources within the same transaction on a Klustron cluster. Some data sources do not support transactions, some FDW components do not support data updates, and some sources do not support two-phase commit transactions. As such, Klustron cannot guarantee transactional reliability (ACID) for such updates.

04 Scalable AI Data Infrastructure

In the future, the widespread application of large models across various industries will result in application software forms that are closely linked to these industry-specific large models. In essence, application software systems will often be dynamically generated and adjusted on demand by these large models, offering a level of simplicity, flexibility, and ease of use that surpasses that of previous low-code technologies. Such flexibility in application software necessitates a set of capabilities in data management, which we define as a scalable AI data infrastructure, encompassing at least the following aspects:

  1. Support for vector data management and vector distance (similarity) queries.
  2. Automated fault recovery processes that require no manual intervention.
  3. Compatibility with SQL standards, as large models will generate SQL queries of varying complexity that adhere to these standards, which the database system must be capable of executing.
  4. Elastic scalability of computational and storage resources on demand, without the need for manual intervention.
  5. Unified management and operation of relational data models as well as complex data types such as GIS, JSON, and vectors.

Application software generated by large models requires a unified data management interface to manage relational data, GIS, JSON, and vector data within a single database system. If one were to use multiple, separate database systems to store these four types of data, reliably implementing cross-database system transactions would far exceed the capabilities of large models.

Zetuo Kunlun Klustron possesses all the aforementioned capabilities and supports the execution of machine learning plugins such as PostgresML and Apache MadLib within computational nodes, as well as user-defined data processing modules created in programming languages like Python and Java. It also supports the pgvector vector database plugin for managing vector data and performing vector distance (similarity) queries.

With this comprehensive set of capabilities, Zetuo Kunlun Klustron embodies a scalable AI data infrastructure.

END