Skip to main content

Using Advanced SQL Features in KunlunBase

KlustronAbout 13 min

Using Advanced SQL Features in KunlunBase

KunlunBase supports a wealth of advanced SQL features, including views, materialized views, triggers, stored procedures, domains, CHECK constraints, Row Level Security (RLS), and multi-level, multi-granularity access control, among others. These features are inherited from PostgreSQL, and we have expanded and enhanced them to continue working properly in the KunlunBase distributed database system.

These advanced SQL features have existed since the classic database era. In the current era of distributed databases, are these features still useful? What impact do they have on the performance of the database system? Are there better alternatives? These are questions that technical personnel with certain database usage experience would consider. Below, we will introduce the usage scenarios for each feature, and the value and pros and cons of these features in KunlunBase.

Access Control

In this era of information and intelligence, data is widely metaphorized as gold, so databases are like vaults. It's crucial to define precise Access Control Rules (ACR) in the database system, which is the source of the data, to ensure that only users with operation permissions on specific data can extract the data from the DBMS or perform add, delete, and modify operations. This is the only way to ensure data security. Implementing ACR outside of the database system is akin to naively and irresponsibly hoping that if you open the vault door and let anyone take the gold out at will, it won't be lost. Therefore, managing the access control rules for data is a key task of a professional DBA.

KunlunBase has inherited the complete access control system from PostgreSQL, including the management of Rolesopen in new window and Usersopen in new window, connection authorization (pg_hba.conf)open in new window, ACR managementopen in new windowand execution, etc. Their functionalities and usage are identical to PostgreSQL. Therefore, this article will mainly focus on a few features that are often overlooked by users.

KunlunBase supports multi-level, multi-granularity access controlopen in new window. It supports setting ACR at various granularities such as database, schema, table, row, column, domain (which can be understood as column type), and sequence. The ACR can control all meaningful operation types for specific types of database objects.

For each user created on the computing node, KunlunBase also creates a user with the same name on the storage node, and uses the username identical to the currently logged in user to connect to the storage node. Therefore, the owner of the table shards created by each user on the storage node is the login username at the time of operation. Although the storage node also has access control capabilities, KunlunBase's access control check is performed on the computing node because when executing an SQL statement, only after passing the access control check can it enter the query optimization and execution stage. The KunlunBase storage node still requires username and password verification to connect, and unless in special cases, DBAs do not need to directly connect to the storage node, and application software should definitely not directly connect to the storage node (and application software cannot connect to the storage node using the username and password created on the computing node). Otherwise, it would not only bypass the access control mechanism of the computing node, but if the user has ALTERed the data table, the computing node would be unaware, causing serious metadata mismatch issues.

KunlunBase also supports defining ACR on views, materialized views, and stored procedures. Through views, you can define very precise and flexible ACR for the combination of specific rows and columns from multiple tables, this part is introduced in the section on views. By defining ACR for stored procedures, the operations controlled by ACR are extended from basic operations such as add, delete, query, and modify to an overall composed of a group of basic operations in a user-defined manner, and the data controlled by ACR is defined by the process logic, achieving the highest level of flexibility and customizability.

Row Level Security

The row-level ACR of KunlunBase is implemented based on the Row Level Security (RLS)open in new window feature. RLS, combined with the column-level ACR of the GRANT statement, allows users to define access control rules at the finest granularity, which is at the field level. Users need to execute the CREATE POLICYopen in new window to define which operations (SELECT, INSERT, UPDATE, DELETE) specific users can perform on specific rows of specific tables. This 'which rows' is an expression, referencing several columns of this table, calculating a boolean value using the corresponding fields of each row of the referenced columns. Only when this value is true can the user perform the operation specified by this rule on this row. Multiple POLICIES can be defined for a table to separately control the permissions of multiple users for each operation on this table.

RLS will have a minor impact on the performance of adding, deleting, modifying, and querying, but what is gained is a very fine-grained data access control ability and data security guarantee. In practice, POLICIES can be created for a few tables containing key data. For example, the salary column in the personnel information table should be something that even the DBA account cannot see, only high-level managers such as the Human Resources Director can see the salaries of all employees, department managers can see the salaries of their department employees, and individual employees can see their own salaries. The agreement prices of suppliers, product agreement prices for specific customers, and other customized details are highly confidential information for many companies, which the DBA has no right to see, only the accounts of the dedicated high-level managers can see. This can fully guarantee the confidentiality of company information.

At the same time, application software needs to connect to the database with the respective dedicated database accounts of those users with specific high-level permissions, rather than connecting to the database with a unified database account. The unified database account only has the most basic permissions and cannot see any data that requires access control.

Domain

In KunlunBase and PostgreSQL, a Domainopen in new window is equivalent to the concept of domain in relational algebra, and it is essentially a type of column. A Domain carries the column's data type, default value, and column constraints (including nullability and CHECK constraints). Therefore, after defining a domain with the CREATE DOMAINopen in new windowcommand, this domain can be used as the column type for any table as long as it is in line with business logic. You can use ALTER DOMAINopen in new windowto modify the definition of a domain, but its data type cannot be changed. Meaningful modifications mainly include default values and constraints. After the modification, the corresponding columns of all tables that use this domain are also modified.

Constraints

A table's regular data validity constraints include common rules such as data type, primary key, uniqueness, and nullability. They can standardize data validity to a certain extent, but they can't specifically regulate the legal value range and constraint relationships of one or more fields of each row in a table. Only by using CHECK constraints can this be achieved, so CHECK constraints are extremely important for the correct implementation of business logic in application systems.

Data validity verification rules must be defined in the data source, which is the database system, instead of relying on the application software layer to perform such checks and verifications. Otherwise, illegal data will inevitably enter the database system in practice, thus affecting the normal operation of the application system. This is because as application software continues to iterate and expand, those data validity rules may have been strictly followed in the initial system design, but may have been omitted later due to changes in development personnel and lack of technical documentation.

Illegal data is like poison and pollutants, which can contaminate valid data, affect the stability of the application system, and even mislead data analysis decisions. Once illegal data enters the database, the existing business logic may exhibit unpredictable abnormal behaviors, affecting the stability of the system or even causing it to not work correctly. This is because the modules that use this data process it according to business needs and do not expect these illegal data. Therefore, the data validity verification mechanism of the database is like a filter, filtering out toxic, illegal, and incorrect data, ensuring that the data stored in the database are meaningful and legal.

CHECK Constraints

Defining CHECK constraintsopen in new window in the CREATE TABLEopen in new window statement can check the validity of one or more fields of each row during each INSERT and UPDATE operation. INSERT/UPDATE statements that do not comply with the validity rules of this table will execute errors and be rolled back, which prevents illegal data from entering the database.

Foreign Key

In the classic database era, foreign keys were a common constraint, but the referential integrity rules of foreign keys have a significant impact on the performance of INSERT, UPDATE, DELETE operations. In distributed database systems, this performance overhead is even greater - the row in table t1 that references table t2 may be on another node, making referential integrity checks extremely expensive, so KunlunBase does not support foreign keys. This is the only commonly used classic SQL advanced feature that KunlunBase does not support.

The initial necessity for foreign keys was that Edgar Codd, when designing the relational model, intended SQL to be the language for users to directly operate the data in the database, just as in that era, using a computer was like typing commands into it. This makes it easy to trigger errors caused by human misoperations. However, in today's actual application systems, databases are run as backend servers of application software, SQL is written by application software developers or auto-generated by ORM middleware, and then sent to the database by the application software, which eliminates human errors from the actual running period. These types of errors are resolved during the development and debugging period.

At the same time, the application's operation logic can also correctly implement the referential integrity rules that meet the requirements. If a row in table t2 is to be deleted, the rows in table t1 that reference it can be explicitly deleted by the application software, or preserved, or set to NULL or other values. This is safer, more flexible, and can prevent data loss caused by cascade deletion.

Special Advice for MySQL Users --- Use the Appropriate Data Types

MySQL officially supported CHECK constraints not until version 8.0.16. Before that, the user-defined CHECK constraints were directly ignored by the parser. The lack of CHECK constraints had once brought great risks to MySQL application development. Relying on application software for data validity checks can easily lead to the unintentional or even intentional bypass of validity rules. Therefore, it is strongly recommended for MySQL users to quickly develop a good habit of setting validity check rules in the CREATE TABLE statement.

MySQL supports very flexible data type conversions. This flexibility is also a curse, causing some novice MySQL users to lose their understanding of data types, leading to the invalidation or misuse of the constraints, validity checks, comparison methods, and other functions of data types.

The data type itself is the most basic data validity constraint. It specifies the range of legal data and the operations that can be performed, especially the methods for comparison and data sorting. Not all data types can be converted between each other. For example, converting dates, times, and timestamps into numeric values generally doesn't make much sense, but MySQL supports this, and its conversion method is imaginative --- for example, the date value '2023-07-16' is interchangeable with the integer 20230716.

This arbitrary data type conversion not only easily leads to unexpected illegal values entering the data table, but also can cause indexes to work improperly or not as expected, such as not finding the values that should be in the table, common in range searches; or the situation where an index should be used but is not actually used.

For example, some users tend to define all numeric types of a table as strings. The problem with this is that when you want to do a range search, the returned results are actually wrong because the data is compared as strings. Unless you pad zeros on the left side of the number represented as a string, but then you either face the risk of numeric overflow because the predetermined width is small, or you occupy space many times larger than the numeric type. Also, the CPU overhead for string comparison is much greater than for numeric/decimal type comparisons.

Some users tend to define timestamp columns as string or datetime types, which causes the timezone information of the timestamp type to fail, unable to display the local timezone values for users in different time zones. Moreover, defining any date, time, datetime, timestamp column as a string type will lead to incorrect range searches, mixed data display formats, ineffective localization (l10n) and internationalization (i18n), illegal values entering the data table, etc., and the storage space occupied is much larger than using the correct type.

Finally, let's talk about the charset and collation properties of string types. MySQL supports specifying charset and collation at the table and column level. This is a very flexible feature, but it also poses a huge pitfall for many users. In practice, many users often encounter problems due to the mismatch of charset and collation between two columns. For example, t1.a and t2.b are both varchar(64) type with UNIQUE constraints, but their collations are different. Then a condition like WHERE t1.a = t2.b AND t2.b='xxx' appears in the query statement. Then you EXPLAIN and find that a full table scan has been done on t1 instead of using the unique index of t1.a, resulting in poor performance. Another issue is the performance problem brought about by collation conversion operations. If a large number of row fields are undergoing collation conversion operations in the same statement, the performance overhead is also considerable.

Therefore, it is best to use UTF8MB4 uniformly within a database. UTF8MB4 character set includes all the words of all human languages, as well as emoji expression packs. It's really comprehensive and there's no need to use other character sets. Some domestic users are used to using GB2312/GB18030/GBK and other character sets, but these character sets do not include many foreign words. If one day your business can go abroad, you will find that the input data of overseas users cannot be correctly used in your system. Therefore, it is recommended to use UTF8MB4 uniformly.

Views and Materialized Views

Viewsopen in new window are a very useful advanced SQL feature. They separate the details of data storage across multiple tables from the logical meaning and application scenarios of the data, functioning like a data interface. If the storage structure of a data table changes, you only need to modify the view definitionopen in new window accordingly, and SQL at the application layer usually does not need to be modified. For application system developers, querying data from views is more intuitive, typically eliminating the need to write multi-table join statements or to construct projection expressions, as these are already defined in the view definition. Usually, querying a view only requires adding a filter condition.

KunlunBase supports defining access control rules for views. That is, data access control rules are defined according to the application scenarios and logical meaning of the data, which are more intuitive and precise. Therefore, a well-designed application system should have a view layer, and application developers almost always query a certain view to obtain data, rarely needing to directly query base tables.

Materialized views are based on views, storing the results of view queries in a data table file. Thus, if a view is queried repeatedly, it avoids the need to repeatedly execute the query statements in the view definition (usually quite complex), thereby achieving better query performance. However, the results cached in the materialized view become gradually outdated as the related base tables continue to be updated. Therefore, after creating a materialized viewopen in new window, it is necessary to periodically execute the REFRESH MATERIALIZED VIEWopen in new window statement to update the cached data.

In KunlunBase, the data of materialized views is stored in storage nodes. Replaying a CREATE MATERIALIZED VIEWopen in new window statement in a computing node will not refresh the data again, and the REFRESH MATERIALIZED VIEWopen in new window statement will not enter the DDL log. That is, replaying DDL in a computing node will not refresh the materialized view repeatedly, causing performance problems.

Triggers

A trigger allows users to customize operations that need to be performed for each row or rows that meet filtering conditions (row-level triggers) before or after insert, update, or delete operations, as well as operations to be performed before and after insert, update, or delete statements are executed for a table (statement-level triggers).

So why not perform these operations in application code? For row-level triggers, it's indeed impossible to accurately capture this timing for such customization outside the database, unless the entire data update logic is implemented as a stored procedure where you use a cursor to scan rows that meet conditions, performing customized operations before and after modifications; thus, row-level triggers do have some value. KunlunBase supports row-level triggers.

For statement-level triggers, these can indeed be performed in the application code by executing them before or after within the same transaction. However, considering the maintainability and lifecycle of application software, after some years many applications become hard to maintain. So if you need to add these custom operations but can't modify the application code, adding statement-level triggers to extend the behavior of statements is a good way to remedy the situation and extend the life of the application. DBAs can do this without relying on the version cycle of the application software supplier, such as adding some auditing operations.

Performance Overhead

Both row-level and statement-level triggers increase the workload of query processing and may inadvertently significantly impact the performance of query statements. In the era of traditional databases, database systems could not scale horizontally, making this impact more severe and difficult to resolve, necessitating the use of a more powerful and expensive server. For KunlunBase, a distributed database, you can add computing nodes as needed to increase computing capacity. Therefore, the performance overhead and pressure brought by triggers and stored procedures can be perfectly solved by adding computing nodes.

Alternatives to Row-Level Triggers

For some requirements, the use of row-level triggers may be replaced by handling data update event streams afterward. For MySQL, this means processing the binlog stream; KunlunBase's CDC (Change Data Capture) functionopen in new window can output a data change stream for external plugins to consume. This method is currently quite common and can make use of related toolchains.

Stored Procedures

Stored proceduresopen in new window allow users to define a set of operations that can then be invoked. The benefits are similar to defining and invoking functions in a software development language code. Stored procedures avoid network latency and bandwidth consumption between the database server and the application server, as the data is operated on only within the database cluster, which can sometimes achieve good performance.

Stored procedures can implement better and more flexible data access control. After sensitive data has been calculated and processed by a stored procedure, the caller only gets the final processed result, ensuring that sensitive data does not leave the database. You can define user and role execution permissions for stored procedures to ensure authorized operations.

However, if there is a heavy computational load in the stored procedure, the actual execution performance can be a concern, especially compared to most commonly used programming languages today. In this case, computation time may exceed data transfer time, resulting in lower performance. Moreover, debugging stored procedures can be quite complicated, as there is no debugger to help, which means higher development and maintenance costs.

Stored procedures can also be very effective for maintaining old systems. For functionality implemented by calling stored procedures, theoretically, end-users still have the opportunity to partially update the application software by updating stored procedures and other related database objects such as tables, transactions, triggers, etc., even when the application developer cannot quickly update the application system.

Stored procedures and triggers share similar issues of performance overhead and scalability. However, the solution for KunlunBase is the same --- just add more computing nodes.

KunlunBase inherits PostgreSQL's powerful stored procedure functionality, supporting not only PL/SQL for writing stored procedures but also languages like Python, Perl, etc. Additionally, the PostgreSQL community has stored procedure plugins for languages such as Lua, Java, JavaScript, etc. Once installed, you can use these languages to write stored procedures, and the execution efficiency of these languages is higher than PL/SQL, avoiding the performance weakness of PL/SQL stored procedures. They also have rich function and class libraries, such as Python's machine learning and data analysis libraries. This allows KunlunBase's data nodes to be used as data processing nodes, offering at least two advantages: data analysis and privacy calculations can be implemented without data leaving KunlunBase, and you can increase the data analysis and processing capabilities by adding computing nodes to KunlunBase as needed, achieving horizontal scalability of computational capacity.

Summary

For the advanced SQL features discussed in this article, we believe that apart from foreign keys, all other features have certain practical values, especially the multi-level, multi-granular access control, various constraints besides foreign keys, and features such as views, and materialized views. They are very valuable for designing and implementing excellent application software systems.

In particular, it is important to use data types accurately and give serious consideration to NULL-ability and uniqueness, two points that are often overlooked. Modifying a column's nullability or changing the data type with ALTER TABLE, even if the same base type is widened (int changed to bigint), requires a full table copy of the data. If you can define CHECK constraints when creating a table, that would be great. For architects who understand application requirements and business logic, this is achievable. Late modifications may face issues of handling existing illegal data in the table. These remedial efforts often get abandoned due to maintenance difficulties, which can impact the stability, usability, and maintainability of the application system.

We recommend using stored procedures for data analysis scenarios. This can leverage the rich data analysis and machine learning libraries of languages like Python, enabling quick development of related functionalities and completing analysis and computation tasks without the data leaving the database. This can result in rapid iterations. Moreover, users can take advantage of the horizontal scalability of KunlunBase's computing nodes and add computing nodes as needed, so analysis and computation tasks will not encounter performance bottlenecks. DBAs can allocate a separate set of computing nodes for data analysts to carry out data analysis without impacting transaction processing loads.

The use of triggers and stored procedures should be determined judiciously. In many cases, these two features might not be the optimal technical path to implement related functional logic anymore. They can be used more as a workaround for compatibility with old application systems.

END