Question

我需要存储1到20 TB的文档。每个文档都有二进制数据和元数据。二进制数据可以从几千字节到几兆字节（文本文件，图像，音频，视频）。元数据是键值对的列表。例如; id：123 数据：blob（几Kb） fileType：doc fileName：antibiotics_ar4.doc 路径：\ datacenter \ medicine \ antibiotics_ar4.doc 创建时间：01.01.2014 15:00:30 关键词：[«抗生素»，«药物»，«SomeFirm，Inc»] field6：... fieldN：...

我需要：

将数据插入存储。我有1到20个插入线程，每个线程每秒几乎有1兆字节。
在存储中搜索没有明显延迟（最多5-10秒）。我在高峰时间有2-3个并发用户。

用户可以通过以下方式搜索前10-1000个文件：

只有关键词（抗生素+药物）
关键字+创建（药品+ 2014年1月1日至2014年3月3日）
keywords + fileType（medicine + xls）

对于全文搜索，我想使用Lucene / Elastic Search / Solr但是在一个查询中按日期，整数，字符串（有一些选项，如红色，绿色，黄色，蓝色）搜索呢？服务器端将用java编写。

我怎样才能在MongoDb中做到这一点？我该怎么做：在每个字段或其他内容上创建索引？

Answer 1

使用GridFS
使用GridFS的元数据＆＃34;功能＆＃34;
在要搜索的元数据字段上创建文本索引
完成 - 没有Lucene

编辑：

在评论中，询问文本搜索是否可以跨越多个字段。虽然每个集合只能有一个文本索引，但该索引可能包含多个字段。

// Update a previously saved file with tags
// Import skipped for readability
db.fs.files.update({filename:"VeryImportantDocument.docx"},{$set:{tags:["foo","bar","baz"]}})

/* Create the index
   As you can see, there are multiple fields which together will
   provide the entries in the index 
*/
db.fs.files.ensureIndex({"tags":"text",filename:"text"});

db.fs.files.find({$text:{$search:"FOO"}})
> {"_id":... //Abbreviated result. Tags can be searched with that index

// search for filetypes is also possible (via extension, not mime-type, of course)
db.fs.files.find({$text:{$search:"docx"}})
> {"_id":...

db.fs.files.find({$text:{$search:"very"})
>{} // No result, searches for parts have to be done via the regex operator
db.fs.files.find({"filename":{$regex:"VeRy",$options:"i"}})
> {"_id":...

至于日期：如果你有统一的格式，你可以确定（因为解析或其他东西），你可以添加查询的另一部分：

db.fs.files.find({"uploadDate":{$gte:startOfTimeFrame,$lte:EndOfTimeFrame}})

这当然可以与通常的逻辑运算符结合其他查询。

如何在MongoDb中快速查询数TB的数据？

1 个答案: