The amount of crawling by spiders has increased sharply, resulting in a high server load. Finally, the ngx_http_limit_req_module module of nginx is used to limit the crawling frequency of Baidu Spider. Baidu Spider is allowed to crawl 200 times per minute, and the redundant crawl request returns 503.
nginx configuration:
#Global Configuration
limit_req_zone $anti_spider zOne=anti_spider:60m rate=200r/m;
#In a server
limit_req zOne=anti_spider burst=5 nodelay;
if ($http_user_agent ~* “baiduspider”) {
set $anti_spider $http_user_agent;
}
Parameter description:
The rate=200r/m in the command limit_req_zone means that only 200 requests can be processed per minute.
The burst=5 in the instruction limit_req means that the maximum concurrency is 5. That is, only 5 requests can be processed at the same time.
The nodelay in the instruction limit_req indicates that when the burst value has been reached, when a new request is made, 503 will be returned directly
The IF part is used to judge whether it is the user agent of Baidu Spider. If so, assign a value to the variable $anti_spider. In this way, only Baidu spiders are restricted.
For detailed parameter descriptions, you can view the official documentation.
http://nginx.org/en/docs/http/ngx_http_limit_req_module.html#limit_req_zone
This module uses a leaky bucket algorithm to limit requests.
For the leaky bucket algorithm, see http://baike.baidu.com/view/2054741.htm
For related codes, please check the nginx source code file src/http/modules/ngx_http_limit_req_module.c
The core part of the code is the ngx_http_limit_req_lookup method.