Distributed database computing involves requirements such as distributed transactions, data distribution, and data convergence calculations.
Distributed databases can achieve high security, high performance, and high availability, but of course they also bring high costs (fixed costs and operating costs). We use MongoDB and MySQL
Cluster analyzes the design ideas from the implementation point of view to abstract some of the design methods that we can refer to when designing the database and apply them to our production system.
First talk about the characteristics of relational and non-relational databases:
MySQL’s Innodb and Cluster have complete ACID properties:
A Atomicity The entire transaction will either be completed or rolled back as a whole.
B Consistency Before the transaction starts and after the transaction ends, the integrity restrictions of the database are not violated.
C Isolation The execution of two transactions does not interfere with each other, and the time of two transactions will not affect each other.
D Persistence After the transaction is completed, the changes made by the transaction to the database are persistently saved in the database and are complete.
In order to realize ACID, implementations such as Undo, Redo, MVCC, TAS, signals, two-phase blockade, two-phase commit, and blockade are introduced, and data access paths are introduced, and the whole thing becomes extremely complicated. When MySQL follows the SQL standard and uses the SQL standard, it can achieve seamless migration between RDBMS. Its rich data types, complete business logic control and expression ability have always been the first choice for commercial applications. MongoDB uses collections to represent data, does not have ACID properties, and is generally recognized for its typelessness, rapid deployment, and rapid development. Both RDBMS and MongoDB use an index structure. MongoDB supports B-tree indexes. Indexes are built according to user needs and can be nested between containers at various levels.
In the database, there are two data storage methods:
1. Heap: Data is always piled up at the end of the file according to the method of backward insertion. When using the index structure to access data, the data pointer will be obtained in the index, and then the data will be obtained. When data is deleted, it will be deleted from the corresponding position. Frequently updated heap tables need to be optimized regularly. Using heap tables will break the principle of data sequential access (access optimization is done in the DBMS, and the performance is partially improved). Since there is no fill factor, under the same compression algorithm, Space can be greatly saved, and heap tables are very suitable for sequential range access, such as business scenarios such as data warehouses.
2. Index organization: The general index organization table uses B+ as the construction method. The whole structure is like an upside-down tree (from the perspective of data access flow), routing information is stored on the branches, and all data is stored in leaf nodes. All the leaves are connected in series according to the order. Due to the limitation of time and space access, this can greatly improve data performance. The DBMS accesses and constructs data according to the access path. The depth of the access path directly affects the performance. It is generally recommended to control the access path within 4 ( Less than or equal to 3), the reason is that the cost of accessing multi-layer paths is higher and the cost of maintaining index trees is becoming more and more expensive.
Our common Innodb, MySQL
Cluster and others are index-organized tables, MyISAM is a heap table, and the organizational structure of MongoDB is a heap.
Databases with AICD attributes have index maintenance functions. MyISAM storage engine and MongoDB have a heap organization structure without ACID control, which will lead to inconsistency between metadata and indexes, directly leading to data access failure and data inconsistency. ACID requirements, update (the update described in this article includes all write operations) speed will be greatly improved, MyISAM storage engine needs regular consistency check, precisely because it does not have ACID properties, MyISAM storage engine needs to be Data updates lock the table, resulting in low performance of updates under large concurrency
MySQL Cluster Architecture
Cluster is divided into SQL nodes, data nodes, management nodes (MySQL
Cluster provides an API for internal calls, and external applications can access any layer method through the API interface).
The SQL node provides functions such as user SQL command request, parsing, connection management, query optimization and response, cache management, data merge, sort, clipping, etc. When the SQL node starts, it will synchronize the architecture information to the management node for data query routing .
Data nodes provide functions such as data access, persistence, and API data access.
The management node maintains node activity information, and implements data backup and recovery. The management node will obtain the status and error information of the nodes in the entire cluster environment, and feed back the information of each node in each cluster to all other nodes in the entire cluster. This is very important for the data routing rules of SQL nodes. When the node expands , the data will be rebuilt.
Data nodes use sharding and multiple data storage, at least two copies are stored, the data is stored in memory, and persisted according to the rules of the management node. As a data access location, a large amount of memory support is required.
As the query entry point, the SQL node consumes a large amount ofFor cpu and memory resources, distributed management nodes can be used, and a layer of request distribution and HA control mechanism can be encapsulated outside the SQL node to solve single point and performance problems, and it provides linear expansion function.
The management node maintains global rule information, and when a node fails, a failure notification will occur. In the entire Cluster system, any component supports dynamic expansion and linear expansion, providing a high-availability and high-performance solution.
question:
When adding new data nodes, it is necessary to reconstruct the access path information, which will cause data reconstruction pressure on the management node. It is recommended to perform this operation during non-peak business hours.
Cluster uses automatic key value identification data sharding scheme, users do not need to care about the data slicing scheme (partition key rules are provided in 5.1 and later), transparently implement distributed databases, data sharding rules are based on 1, primary key, 2 unique index, 3 automatic The row ID is completed, and then the number of clusters is distributed. Its access data is the same as the RAID access mechanism. It can extract data from each node in parallel and hash the data. When accessing with a non-primary key or partition key, all cluster nodes will be scanned. Affect performance (this is the core challenge faced by Cluster).
MongoDB replica set architecture, based on MongoDB replication, constructs a distributed database solution:
MongoDB provides a structure similar to MySQL Cluster. In mongod, mongos, and mongo, it includes:
Mongod: Data access interface, which distributes requests to Mongos nodes.
Mongs: functions such as data access routing, query optimization, data merge, sort, and pruning.
mongo: data access (using the mongo protocol also provides direct data access).
When MongoDB builds a collection, it needs to provide data fragmentation rules, which will be recorded in mongos. The query request mongod will initiate a request to mongos, and mongos will access data in mongo according to the access path.
Since MongoDB provides users with a choice on how to slice data, it can quickly access data in the case of transparent access to users.
Problems faced by MongoDB:
When accessing data with non-fragmentation rules (indexes can be built on each shard), it will result in a full scan of all Mongo cluster nodes (which can be achieved by multiple redundant copies and different sharding rules, which is also the current application of data sharding commonly used means).
When a new data cluster is added, all data nodes will be reconstructed, directly affecting performance.
Summary:
MongoDB uses the heap access path method to organize data, does not include ACID features, and has a great improvement in updating and querying large amounts of data (for architectures with MVCC, it will reduce the response speed in high concurrency and large data sets), but it does not have ACID Ensure the stability and security of key data
MongoDB solves the automatic sharding rules of MySQL Cluster (user-defined functions are provided after 5.1), and MySQL Cluster
Cluster. The SQL node data processing work is handed over to mongos, MySQL
Cluster uses the access path of SQL->node->SQL, and MongoDB uses the access path of Mongod->Mongos->node->Mongos->Mongod. From the architectural point of view, MySQL
Cluster is simpler and more efficient.
MySQL
Cluster has complete commercial support and common standard support, relatively rich management tools, MongoDB has relative performance advantages, but lacks strong stability and security support, rich management tools, both have their own advantages, but they have almost the same Achilles’ heel.
MySQL
Cluster can implement a topology based on replication, and synchronize data to different places without changing the internal topology to form a star topology. MongoDB still lacks relevant technical solutions in this regard (of course it can be a replication solution, but MySQL
Cluster is implemented at a higher level, and MongoDB is implemented at a lower level, which will face great challenges for management).
From a commercial point of view, MySQL Cluster has sufficient commercial value, but its flaws are also obvious. MongoDB’s impact on MySQL
The improvement of Cluster is worth thinking about and introduced in daily data architecture design and schema design, but as a large-scale commercial application, MySQL
Both Cluster and MongoDB still have a long way to go, whether it is inherent defects or management models.