Use Python (Stackless)+MongoDB to analyze Apache logs (2G)-1024programmer

Why choose Stackless?

Stackless can be simply considered as an enhanced version of Python, and the most eye-catching non-“micro-threading” is none other than. Micro-threads are lightweight threads. Compared with threads, switching consumes less resources, and sharing data within threads is more convenient. More concise and readable than multi-threaded code. This project is sponsored by EVE
Launched online, it is really strong in terms of concurrency and performance. The installation is the same as Python, you can consider replacing the original system Python. 🙂

Why choose MongoDB?

You can see that many popular applications use MongoDB on the official website, such as sourceforge, github, etc. What are the advantages over RDBMS? First of all, it has the most obvious advantages in speed and performance. It can not only be used as a KeyValue database, but also includes some database queries (distinct, group, random, index, etc.). Another feature is: simple. Whether it is an application, a document, or a third-party API, you can use it with almost a skip. However, it is a pity that the stored data files are very large, 2-4 times more than normal data. The Apache log size tested in this article is 2G, and the produced data files are 6G. Han… I hope to shrink in the new version, of course, this is also the obvious consequence of trading space for speed.

In addition to the two software mentioned above, this article also needs to install the pymongo module. http://api.mongodb.org/python/

The module installation methods include source code compilation and easy_install, which are no longer cumbersome here.

1. Analyze the information that needs to be saved from the Apache log, such as IP, time, GET/POST, return status code, etc.

fmt_str = ‘(?P[.\d]+) – – \[(?P.*?)\] “(?P.*?) (?P. *?) HTTP/1.\d” (?P\d+) (?P.*?) “(?P.*?)” “(?P.*?)” ‘
fmt_name = re.findall(‘\?P’, fmt_str)
fmt_re = re.compile(fmt_str)

Defines a regex to extract the content of each line of log. fmt_name is to extract the variable name between the angle brackets.

2. Define MongoDB related variables, including the name of the collection that needs to be saved. Connection takes the default Host and port.

conn = Connection()
apache = conn.apache
logs = apache. logs

3. Save the log line

def make_line(line):
m = fmt_re.search(line)
if m:
logs.insert(dict(zip(fmt_name, m.groups())))

4. Read Apache log files

def make_log(log_path):
with open(log_path) as fp:
for line in fp:
make_line(line. strip())

5. Run the handle.

if __name__ == ‘__main__’:
make_log(‘d:/apachelog.txt’)

The general situation of the script is like this, there is no stackless part of the code here, you can refer to the following code:

import stackless
def print_x(x):
print x
stackless.tasklet(print_x)(‘one’)
stackless.tasklet(print_x)(‘two’)
stackless. run()

The tasklet operation just puts similar operations into the queue, and run is the real operation. This is mainly used to replace the original multi-threaded threading behavior of analyzing multiple logs in parallel.

Supplement:

Apache log size is 2G, about 6.71 million rows. The generated database has 6G.

Hardware: Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz Desktop

System: RHEL 5.2 file system ext3

Other: Stackless 2.6.4 MongoDB 1.2

When saving about 3 million, everything is normal. Whether it is CPU or memory, and the insertion speed is very good, there are about 8-9000 items/second. It is basically the same as the test results on the previous notebook. Further on, the memory consumption spikes a bit and the insertion speed slows down. When recording about 5 million records, the CPU reaches 40%, and the memory consumption is 2.1G. It seems that the speed and efficiency have improved again when generating the second 2G data file. The final saved result is not too satisfactory.

Later, I retested 10 million data with a notebook, and the speed was significantly improved compared to the above 6.71 million. There are two initial suspicions that performance and speed may be affected:

1. Differences in the file system. The notebook is Ubuntu 9.10, ext4 system. After searching, there will be a gap between ext3 and ext4 in reading and writing large files.

2. Regular matching. Single row operations are all match extractions. There should be room for optimization on large files.