Nginxlog real-time monitoring system based on Storm-1024programmer

Background

UAE (UC App Engine) is a PaaS platform inside UC. The overall structure is somewhat similar to CloudFoundry, including:

Rapid deployment: support Node.js, Play!, PHP and other frameworks
Information transparency: operation and maintenance process, system status, business status
Gray scale trial and error: IP gray scale, regional gray scale
Basic services: key-value storage, MySQL high availability, picture platform, etc.

It is not the main character here and will not be introduced in detail.

There are hundreds of web applications running on the UAE, and all requests will be routed through the UAE. The daily Nginx access log size is terabytes. How to monitor the access trends of each business, advertising data, page time consumption, access quality, Custom reports and exception alarms?

Hadoop can meet the statistical requirements, but the second-level real-time performance is not enough; using Spark Streaming is a bit overkill, and we have no Spark engineering experience; self-written distributed program scheduling is cumbersome and needs to consider expansion and message flow;

Finally, our technology selection is Storm: relatively lightweight, flexible, convenient for message transmission, and flexible in expansion.

In addition, since there are many clusters in UC, cross-cluster log transmission will also be a relatively big problem.

Technical preparation

Cardinality Counting

In big data distributed computing, PV (Page View) can be easily added and merged, but UV (Unique Visitor) cannot.

In the case of distributed computing, hundreds of businesses and hundreds of thousands of URLs count UV at the same time. If statistics are also divided into periods (every minute/every 5 minutes merged/hourly merged/daily merged), memory consumption is unacceptable .

At this time, the power of probability is reflected. As we can see in Probabilistic Data Structures for Web Analytics and Data Mining, the precise hash table statistical UV and cardinality count memory comparisons are not of the same order of magnitude. Cardinality counting allows you to merge UVs with minimal memory consumption and acceptable errors.

You can first understand LogLog Counting. On the premise of understanding the uniform hash method, the reason for the rough estimate is enough, and the subsequent formula derivation can be skipped.

The specific algorithm is Adaptive Counting, and the computing library used is stream-2.7.0.jar.

Real-time log�lose

Real-time computing must rely on second-level real-time log transmission. The added benefit is that network congestion caused by periodic transmission can be avoided.

Real-time log transmission is an existing lightweight log transmission tool in UAE. It is mature and stable, and it can be used directly, including client (mca) and server (mcs).

The client monitors the changes of the log files of each cluster, transmits them to each machine of the specified Storm cluster, and stores them as ordinary log files.

We adjusted the transfer strategy so that the size of the log files on each Storm machine is roughly the same, so the spout can only read the local data.

data source queue

We did not use the queues commonly used by Storm, such as Kafka, MetaQ, etc., mainly because they are too heavy…

fqueue is a lightweight memcached protocol queue, which converts ordinary log files into memcached services, so that Storm spouts can be read one by one directly using the memcached protocol.

This data source is relatively simple. It does not support replay. A record no longer exists after being fetched. If a tuple fails to process or times out, the data will be lost.

It is relatively lightweight, based on local file reading, and has a thin layer of cache. It is not a pure memory queue. Its performance bottleneck lies in disk IO, and the throughput per second is consistent with the disk reading speed. But it is enough for our system, and there are plans to change it to a pure memory queue in the future.

Structure

Through the above technical reserves, we can obtain the user’s log a few seconds after the user visits.

The overall architecture is also relatively simple. The reason why there are two computing bolts is based on the consideration of the uniform distribution of computing. The amount of business varies greatly. If fieldsGrouping is performed only by business ID, the computing resources will be unbalanced.

The spout standardizes each original log, and distributes it to the corresponding stat_bolt according to the URL grouping (fieldsGrouping, to keep the calculation amount of each server uniform);
stat_bolt is the main calculation bolt, which combs and calculates the URLs of each business, such as PV, UV, total response time, backend response time, HTTP status code statistics, URL sorting, traffic statistics, etc.;
merge_bolt merges the data of each business, such as PV number, UV number, etc. Of course, the UV merging here uses the aforementioned base count;
I wrote a simple Coordinator coordination class. The streamId is marked as “coordinator”. Its functions are: time coordination (segmentation of batches), checking task completion, and timeout processing. The principle is similar to the Transactional Topolgoy that comes with Storm.
Implement a Scheduler to obtain parameters through the API, and dynamically adjust the distribution of Spouts and Bolts on each server, so as to flexibly allocate server resources.
Supports smooth upgrade of topology: when a topology is upgraded, the new topology and the old topology will run at the same time to coordinate the switching time. When the new topology takes over the fqueue, the bridge will be demolished and the old topology will be killed.

be careful:

Storm machines should be deployed in the same cabinet as much as possible, without affecting the bandwidth in the cluster;
Our Nginx logs are segmented by hour. If the split time is not accurate, you can see obvious data fluctuations at 00 minutes. Therefore, try to use the Nginx module to split the logs, and use crontab to send signals to cut them There is a delay. The 10-second delay in cutting logs is not a problem in large-scale statistics, but the fluctuations in second-level statistics are obvious;
If the heap is too small, the woker will be forcibly killed, so configure the -Xmx parameter;

Custom items

Static resources: Static resource filtering options, filter specific static resources by Content-Type or suffix.
Resource merging: URL merging, such as RESTful resources, can be easily displayed after merging;
Dimensions and indicators: ANTLR v3 is used for grammar and lexical analysis to complete custom dimensions and indicators, and subsequent alarms also support custom expressions.

merge_bolt merges the data of each business, such as PV number, UV number, etc. Of course, the UV merging here uses the aforementioned base count;

I wrote a simple Coordinator coordination class. The streamId is marked as “coordinator”. Its functions are: time coordination (segmentation of batches), checking task completion, and timeout processing. The principle is similar to the Transactional Topolgoy that comes with Storm.

Implement a Scheduler to obtain parameters through the API, and dynamically adjust the distribution of Spouts and Bolts on each server, so as to flexibly allocate server resources.

Supports smooth upgrade of topology: when a topology is upgraded, the new topology and the old topology will run at the same time to coordinate the switching time. When the new topology takes over the fqueue, the bridge will be demolished and the old topology will be killed.

be careful:

Storm machines should be deployed in the same cabinet as much as possible, without affecting the bandwidth in the cluster;
Our Nginx logs are segmented by hour. If the split time is not accurate, you can see obvious data fluctuations at 00 minutes. Therefore, try to use the Nginx module to split the logs, and use crontab to send signals to cut them There is a delay. The 10-second delay in cutting logs is not a problem in large-scale statistics, but the fluctuations in second-level statistics are obvious;
If the heap is too small, the woker will be forcibly killed, so configure the -Xmx parameter;

Custom items

Static resources: Static resource filtering options, filter specific static resources by Content-Type or suffix.
Resource merging: URL merging, such as RESTful resources, can be easily displayed after merging;
Dimensions and indicators: ANTLR v3 is used for grammar and lexical analysis to complete custom dimensions and indicators, and subsequent alarms also support custom expressions.

Nginxlog real-time monitoring system based on Storm

Background