GridFS is a distributed file system on top of MongoDB. It utilizes MongoDB’s distributed storage mechanism and stores file data and file metadata through MongoDB. It has the advantages of both document database and file system. GridFS is the product of the current big data trend and complex data analysis needs.
To put it simply, GridFS realizes the file system by storing file data and file metadata in MongoDB, and handles failover and data integration through replication (Replication). It can also be used for read expansion, hot backup or offline batch processing. The data source can automatically split data through sharding, realize big data storage and load balancing, realize lightweight file system interface through database management and query of documents in the collection (including MapReduce) and Search and Analytics.
A basic idea of GridFS is that large files can be divided into many blocks, and each block is stored as a separate document, so large files can be stored. Since MongoDB supports storing binary data in documents, the storage overhead of blocks can be minimized. GridFS uses MongoDB’s replication, sharding and other mechanisms to implement distributed file storage, and uses MongoDB for management and complex analysis.
GridFS uses two documents to store files, one is used to store the blocks of the file itself, and the other is used to store the block information and metadata of the file. The default corresponding collections are fs.chunks and fs.files respectively.
Chunks collection:
{
“_id”:,
“files_id”: ,
“n”: ,
“data”:
}
The documents in the chunk collection contain the following attributes: chunk_id: Chunk ID. Chunks.files_id: corresponds to the _id of the document in the files collection. Chunks.n: The number of chunks, managed by GridFS, starting from 0. Chunks.data: file data, which is BSON binary type.
The Chunks collection uses files_id and n as a mixed index, and the files collection:
{
“_id”: ,
“length”: ,
“chunkSize”: ,
“uploadDate”: ,
“md5”: ,
“filename”: ,
“contentType”: ,
“aliases”: ,
“metadata”:
}
Documents in the Files collection contain the following attributes, and applications can create additional arbitrary attributes: files_id: A unique file representation. The default value for MongoDB is BOSN
ObjectID. Files. length:
The size in bytes of the file. Files.chunkSize: the size of each block, the default is 256KB, GridFS divides the file into multiple chunks according to this value, files.uploadDate: the time when GridFS first stored this file, the type is ISODate. Files.md5: The md5 hash value of the file, which is a string.
Files.filename: Optional. Human-readable filename. Files. contentType: Optional. Valid file MIME types. Files.aliases: Optional. String array of aliases. Files.metadata: Optional. Customize stored file metadata.
GridFS can be used through the mongofiles tool or the MongoDB driver. GridFS mainly provides five operation interfaces:
List: Get a list of files
Get: get the file
Put: write to a file
Search: Search for files by filename
Delete: delete files
Because the metadata of GridFS files is stored in the files collection, GridFS can be very convenient for file management, such as querying based on file name, upload time, file size or custom file metadata, and can also use MapReduce for complex data analysis . This is one of the many benefits of GridFS combining traditional file systems and databases.
Advantages over traditional file systems
Distributed: GridFS is a distributed file system based on MongoDB, which can directly use MongoDB
Replication and Sharding mechanism, data reliability and horizontal scalability are guaranteed. GridFS does not generate disk fragmentation, because MongoDB allocates data file space in 2GB blocks.
MapReduce: It can perform complex management and query analysis.
Indexing and caching: Metadata is stored in MongoDB, which is very convenient for indexing, and can index files and file metadata, which can improve system efficiency.
Checksum: GridFS will generate a hash value for the file, which can be used to verify the file to check the integrity.
Developer-friendliness: Using Grid can simplify requirements and reduce development costs. If you have already used MongoDB, GridFS does not need to use an independent file storage architecture, and the code and data are truly separated for easy management.
other:
GridFS avoids some of the problems with filesystems used to store user-uploaded content. For example, GridFS has no problem keeping a large number of files in the same directory. GridFS does not generate disk fragmentation, because MongoDB allocates data file space in 2GB blocks.