Recently, I am doing statistical analysis of search query logs. For each query statistical log, I parse it out and store it in mongodb in a specific field format, and schedule it regularly to do some statistical analysis. One of the requirements is to count the number of queries of each query in a certain period of time (daily, weekly, and monthly), and display the popular query queries. Considering that the amount of data processed will not be large, the solution can also be simple. The method I use now is the MapReduce function of mongodb. In fact, this requirement can also be considered as a group operation, and the group function of mongodb is based on MapReduce, but the group has a limit on the size of the result set. This article introduces mongodb for an example
MapReduce functionality.
Grammar introduction
MapReduce is a Command in mongodb, its syntax is as follows:
db.runCommand(
false>]
[, finalize : ]
[, scope : ]
[, verbose : true]
);
For this Command, I will not explain the three necessary parameters. For optional parameters, here is a brief description as follows:
(1) query is very commonly used. It is used to filter the query conditions in the map stage to limit the record range of MapReduce operations.
(2)
There are also sort and limit related to query. At first I thought they were used in the reduce stage, but they are actually used together with query in the map stage.
(3)
By default, mongodb creates a temporary collection to store MapReduce results. When the client connection is closed or collection.drop() is displayed, the temporary collection will be deleted. That is to say, the default keeptemp is false, if keeptemp is true, then the resulting collection is permanent. Of course, the name of the generated collection is not friendly, so you can specify out to indicate the name of the permanent storage collection (there is no need to specify keeptemp at this time). When out is specified, the execution result is not directly stored in out, but also in a temporary collection, and then dropped if out exists, and finally renames the temporary collection to out.
(4) finalize: Applied to all results when MapReduce is completed, usually not used much.
(5) verbose: Provides statistics on execution time.
The format of the execution result is as follows:
{ result : ,
counts : input : ,
emit : ,
output : ,
timeMillis : ,
ok : ,
[, err : ]
}
Helpers for more commonly used MapReduce commands are:
db.collection.mapReduce(mapfunction,reducefunction[,options]);
The map function is defined as follows. The map function uses this to manipulate the object represented by the current row, and needs to use the emit(key,value) method to provide parameters to reduce:
function map(void) -> void
The reduce function is defined as follows. The key of reduce is the key of emit(key, value), and value_array is multiple value arrays corresponding to the same key:
function reduce(key, value_array) -> value
The format of the collection obtained by MapReduce is “_id”:key,”value”:.
Application example
Here is a hypothetical meaningless example, mainly to illustrate mongodb
Use of MapReduce. The schema of each record is “query”:,”cnt”:,”year”:,”month”=>. The cnt of this schema is not needed, because the cnt of each query is 1, but here we want a little more complicated conditions. The following is available in mongodb
A MapReduce script executed in the shell.
map = function() emit(this.query, this.cnt);;
reduce = function(key , vals) {
var sum = 0;
for(var i in vals) sum += vals[i];
return sum;
};
res = db.log_info.mapReduce(map,reduce,{“query”:”year”:2010});
The execution results are as follows:
{
“result”: “tmp.mr.mapreduce_1284794393_2”,
“timeMillis”: 72,
“counts” : “input” : 1000,
“emit”: 1000,
“output”: 113,
“ok” : 1,
}
For “result”, it is the generated temporary collection name, the naming rule of this name is: “tmp.mr.mapreduce_”+time(0)+”_”+(jobNumber++)
Execute db[res.result].find() to get:
“_id” : “a”, “value” : 521
“_id” : “aa”, “value” : 128
“_id” : “aaa”, “value” : 40
“_id” : “aaaa”, “value” : 4
“_id” : “aaab”, “value” : 9
“_id” : “aaac”, “value” : 13
“_id” : “aab”, “value” : 45
“_id” : “aaba”, “value” : 5
“_id” : “aabb”, “value” : 14
“_id” : “aabc”, “value” : 20
“_id” : “aac”, “value” : 39
“_id” : “aaca”, “value” : 6
“_id” : “aacb”, “value” : 2
“_id” : “aacc”, “value” : 5
“_id” : “ab”, “value” : 65
“_id” : “aba”, “value” : 37
“_id” : “abaa”, “value” : 12
“_id” : “abab”, “value” : 13
“_id” : “abac”, “value” : 10
“_id” : “abb”, “value” : 42
Java client API usage
Like JS scripts, the mongodb Java client provides two MapReduce interfaces, namely:
public MapReduceOutput mapReduce( String map , String reduce , String outputCollection , DBObject query );
public MapReduceOutput mapReduce( DBObject command );
MapReduceOutput is implemented as follows:
public class MapReduceOutput {
?
MapReduceOutput( DBCollection from , BasicDBObject raw )_collname = raw.getString( “result” );
_coll = from._db.getCollection( _collname );
_counts = (BasicDBObject)raw.get( “counts” );
?
public DBCursor results() return _coll.find();
?
public void drop()_coll.drop();
?
public DBCollection getOutputCollection() return _coll;
?
final String _collname;
final DBCollection_coll;
final BasicDBObject _counts;
}
Therefore, you can call MapReduceOutput.results() to get DBCursor for subsequent processing. For example, in my application scenario, sort in descending order according to value and take limit
1000 to get some of the most popular queries.
Due to the limitations of the Javascript engine design, the current mongodb
MapReduce is only single-threaded, and mongodb is also planning to solve this problem. If multi-threaded processing is required, consider sharding or control processing in client code.