Editor’s note: High-availability architecture sharing and dissemination of articles with typical significance in the field of architecture, this article is shared by Yao Sifang in the high-availability architecture group. For reprinting, please indicate that it is from the high-availability architecture public account “ArchNotes”.
Yao Sifang, senior technical expert of Sina Weibo, technical director of Weibo platform architecture group. Joined Sina Weibo in 2012 and participated in several key projects such as Weibo Feed architecture upgrade, platform service transformation, and hybrid cloud. He is currently the technical leader of the Weibo platform architecture group and is responsible for the research and development of the platform’s public infrastructure. He once shared the technology of “Sina Weibo High Performance Architecture” at QCon, focusing on the direction of high-performance architecture and service middleware.
Business background and problems used by Nginx
Nginx With its ultra-high performance and stability, it has been widely used in the industry, and Nginx is widely used on the seventh floor of Weibo. Combined with the health check module of Nginx and the dynamic reload mechanism, the upgrade and expansion of the service can be almost lossless. At this time, the frequency of expansion is relatively low, and in most cases it is planned expansion.
Weibo business scenarios have very significant peak characteristics. There are both routine evening peaks and expected extreme traffic peaks such as New Year’s Day, Spring Festival Gala, and Red Envelope Flying. There are also occasional peaks caused by #周周见# #我们# and other celebrities/social events. The usual way before is buffer + downgrade. When the downgrade is not considered (it will affect the user experience), the buffer is too small and the peak value is too large to bear the cost. Therefore, since 2014, we have been trying to use containerization to realize the dynamic adjustment of the buffer, so as to realize the on-demand expansion/contraction of the buffer according to the traffic, so as to save costs.
In this scenario, there will be a large number of continuous expansion/contraction operations. There are two commonly used solutions for Nginx-based backend changes in the industry. One is DNS-based provided by Tengine, and the other is consul-template-based backend service discovery. The following table briefly compares the characteristics of the two schemes.
Based on DNS: This module is developed by the Tengine team, which can dynamically resolve domain names under upstream conf. This method is easy to operate, just modify the list of servers mounted under dns.
shortcoming :
-
DNS periodically polls for resolution (30s). If the configured time is too short, such as 1s, it will put pressure on the DNS server. If the configured time is too long, the timeliness will be affected.
-
Do not hang too many servers under the DNS-based service, it will be truncated (UDP protocol), and it will also put pressure on the bandwidth.
Based on consul-template and consul: as a combination, consul is used as a db, and consul-template is deployed on the Nginx server. Consul-template regularly initiates a request to consul. If the value changes, it will update the local Nginx related configuration files and initiate reload command. However, in the case of heavy traffic, initiating reload will affect performance. At the same time, reload will trigger the creation of a new work process. For a period of time, the old and new work processes will exist at the same time, and the old work process will frequently traverse the connection list to check whether the request has been processed. If it is over, it will exit the process; another reload It will also cause the long connection between Nginx and client and backend to be closed, and a new work process needs to create a new connection.
Performance impact caused by reload:
<img src="https://www.php1.cn/detail/ AAADw/eHBhY2tldCBiZWdpbj0i77u/IiBpZD0iVzVNME1wQ2VoaUh6cmVTek5UY3prYzlkIj8+IDx4OnhtcG1ldGEgeG1sbnM6eD0iYWRvYmU6bnM6bWV0YS8iIHg6eG1wdGs9IkFk b2JlIFhNUCBDb3JlIDUuMC1jMDYwIDYxLjEzNDc3NywgMjAxMC8wMi8xMi0xNzozMjowMCAgICAgICAgICAgIj4gPHJkZjpSREYgeG1sbnM6cmRmPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5LzAyLzIyLXJ kZi1zeW50YXgtbnMjIj4gPHJkZjpEZXNjcmlwdGlvbiByZGY6YWJvdXQ9IiIgeG1sbnM6eG1wPSJodHRwOi8vbnMuYWRvYmUuY29tL3hhcC8xLjAvIiB4bWxuczp4bXBNTT0ia HR0cDovL25zLmFkb2JlLmNvbS94YXAvMS4wL21tLyIgeG1sbnM6c3RSZWY9Imh0dHA6Ly9ucy5hZG9iZS5jb20veGFwLzEuMC9zVHlwZS9SZXNvdXJjZVJlZiMiIHhtcDpDcmVh dG9yVG9vbD0iQWRvYmUgUGhvdG9zaG9wIENTNSBXaW5kb3dzIiB4bXBNTTpJbnN0YW5jZUlEPSJ4bXAuaWlkOkJDQzA1MTVGNkE2MjExRTRBRjEzODVCM0Q0NEVFMjFBIiB4bXBNTTpEb 2N1bWVudElEPSJ4bXAuZGlkOkJDQzA1MTYwNkE2MjExRTRBRjEzODVCM0Q0NEVFMjFBIj4gPHhtcE1NOkRlcml2ZWRGcm9tIHN0UmVmOmluc3RhbmNlSUQ9InhtcC5paWQ6QkNDMDUxNUQ2QTYyMT FFNEFGMTM4NUIzRDQ0RUUyMUEiIHN0UmVmOmRvY3VtZW50SUQ9InhtcC5kaWQ6QkNDMDUxNUU2QTYyMTFFNEFGMTM4NUI, the impact on performance is limited and negligible.
Applications
The module has been applied in various businesses of Weibo. The chart below compares and analyzes the QPS and time-consuming changes before and after using the module.
It can be concluded from the data that the reload operation causes the request processing capacity of nginx to drop by about 10%, and the time consumption of nginx itself will increase by 50%+. If the capacity is expanded frequently, the overhead caused by the reload operation will be more obvious.
During the New Year’s Day period in 2016, hundreds of times of expansion/reduction were carried out according to the traffic characteristics of different time periods, and the SLA of the overall service during the expansion process was not affected.
The official commercial version supports DNS and push versions of Nginx plus.
Due to data consistency and other issues during use, the extension supports the consul-based pull version
https://github.com/weibocom/nginx-upsync-module is currently improving the wiki and documentation. Click to read the original text to enter.
Q & A
1. Is the registration of machine configuration information in consul automatically adjusted by the Weibo system according to the traffic?
These are two issues. The process of registering backend information with nodes during capacity expansion is automatic and has been integrated into the online system. In addition, Weibo is currently developing and evaluating an online capacity evaluation system, which is currently dealing with semi-automatic adjustments.
2. May I ask why you didn’t consider zk at the beginning? If you switch to zk and don’t use rotation training pull to change to long-term, will there be any difference?
At present, the module is already adding support similar to etcd and zk. Consul was used at the beginning because there were already consul clusters and operation and maintenance personnel in the company. zk is essentially the same as etcd and consul for modules
3. Why not use the Nginx master to pull and then distribute to each work, but use the work process to pull? The former can reduce network interaction and improve the consistency of multiple jobs inside an Nginx
If you use the master to pull, you need to modify the core module. When designing the module, a big principle is to try to ensure that the module has zero dependencies. Overall, it’s a trade off, too.
4. Is the registration of machine configuration information in consul automatically adjusted by the Weibo system according to the traffic?
This question is similar to question 1.
5. Based on what considerations did you choose consul for configuration management?
Similar to question 2.
6. Can you design a set of API to pull for Nginx, it doesn’t matter whether the source is consul or a Java service, it feels more general
The design idea of the module is similar to that mentioned above. An upsync type is designed in it, and different types can be implemented for different sources.
7. In addition, I don’t know if it has solved the problem of too much Nginx routing information taking up too much memory. We are now using LRU to eliminate it. I don’t know if Weibo has such a scenario
This was considered during design. Usually we only keep the current routing table and an expired routing table. After all the requests supported by the expired routing table are processed, this part of the memory will be released.
8. When the traffic is low, what are the unused machines of Weibo used for? If these machines are still being used by other services, when Weibo dynamically loads these machines, what about other services?
We are currently deploying a hybrid cloud, and the buffer pool is created on the public cloud. When the traffic is low, just delete it directly. The machines in the self-owned machine room can usually run some offline services when the traffic is low. The strategy of mixing online and offline operations is currently under development.
9. The actual test results of ab and work show that frequently updating the consul list under high pressure will fail. How do you deal with this problem?
We have tested changing thousands of machines per second without problems. This has been able to support the expansion demands of most (including predators). When we pressure-tested consul, consul did fail (single master provides services), which has nothing to do with the module itself. Need to improve the performance and configuration of the consul cluster.
10. Ask the teacher, I want to have a simple understanding of consul, is consul the routing information stored in memory? How to distinguish the routing information of different services? Can only be named by key value or consul is deployed separately for each service?
Routing table information will be stored in 3 places. 1. The memory of Nginx can directly improve the routing service; 2. It is stored on consul, and each node is stored as a backend key as /$consul_path/$upstream/ip:port. 3. On the file of the server where Nginx is located, store it in the form of snapshot to avoid simultaneous downtime of consulg and Nginx.
�� By service; 2. Stored on consul, storing each node is a backend key as /$consul_path/$upstream/ip:port. 3. On the file of the server where Nginx is located, store it in the form of snapshot to avoid simultaneous downtime of consulg and Nginx.