MongoDB / Java:根据值查找唯一文档

时间:2013-09-12 20:07:15

标签: java mongodb mapreduce grouping distinct

我在MongoDB数据库中有大量带时间戳的文档。每个文件都有一个唯一的标识符。

使用下面的示例文档,我首先想要通过“updateDate”对集合进行排序,然后为包含唯一“domainName”的每个文档检索“uniqueIdentifier”列表。

{
  "domainName": "www.example-domain-0.com",
  "updateDate": {
    "$date": "2013-09-10T19:20:56.652Z"
  },
  "uniqueIdentifier": "375d7219-828c-4f81-a1fc-3692aa68d110"
}

{
  "domainName": "www.example-domain-1.com",
  "updateDate": {
    "$date": "2013-09-12T19:44:56.833Z"
  },
  "uniqueIdentifier": "f96bb647-5dcb-4cc1-8a66-105177a45474"
}

{
  "domainName": "www.example-domain-0.com",
  "updateDate": {
    "$date": "2013-09-12T19:10:56.833Z"
  },
  "uniqueIdentifier": "14f6yu43-20eb-42c6-bb06-26b77c0bf0cb"
}

{
  "domainName": "www.example-domain-2.com",
  "updateDate": {
    "$date": "2013-09-12T19:39:56.833Z"
  },
  "uniqueIdentifier": "b2a6ae10-20eb-42c6-bb06-26b77c0bf0cb"
}

对于上面的集合,我想获得以下有序结果集:

"f96bb647-5dcb-4cc1-8a66-105177a45474",
"b2a6ae10-20eb-42c6-bb06-26b77c0bf0cb",
"14f6yu43-20eb-42c6-bb06-26b77c0bf0cb"

请注意,未返回“375d7219-828c-4f81-a1fc-3692aa68d110”,因为有2个文件包含:

"domainName": "www.example-domain-0.com".

在Java中实现这一目标的最快方法是什么?如果它是map-reduce函数,任何人都可以帮助我理解如何用Java编写它吗?

目前我在Java中使用以下内容,但对于大型集合,效率非常低:

    Map<String, String> domainMap = new HashMap<String, String>();
    BasicDBObject restrict = new BasicDBObject("uniqueIdentifier", 1)
            .append("domainName", 1);
    DBCursor cur = domainCollection.find(null, restrict).sort(
            new BasicDBObject("updateDate", -1));
    while (cur.hasNext()) {
        String id = cur.next().get("uniqueIdentifier").toString();
        String domain = cur.next().get("uniqueIdentifier").toString();
        if (!domainMap.containsKey(domain)) {
            domainMap.put(domain, id);
        }
    }
    cur.close();

1 个答案:

答案 0 :(得分:2)

尝试聚合框架:

> db.foodle.find()
{ "_id" : ObjectId("52323c61fd99d220e24eef53"), "domainName" : "www.example-domain-0.com", "updateDate" : ISODate("2013-09-12T22:12:49.933Z"), "uniqueIdentifier" : "375d7219-828c-4f81-a1fc-3692aa68d110" }
{ "_id" : ObjectId("52323c64fd99d220e24eef54"), "domainName" : "www.example-domain-1.com", "updateDate" : ISODate("2013-09-12T22:12:52.877Z"), "uniqueIdentifier" : "f96bb647-5dcb-4cc1-8a66-105177a45474" }
{ "_id" : ObjectId("52323c67fd99d220e24eef55"), "domainName" : "www.example-domain-0.com", "updateDate" : ISODate("2013-09-12T22:12:55.550Z"), "uniqueIdentifier" : "14f6yu43-20eb-42c6-bb06-26b77c0bf0cb" }
{ "_id" : ObjectId("52323c6afd99d220e24eef56"), "domainName" : "www.example-domain-2.com", "updateDate" : ISODate("2013-09-12T22:12:58.390Z"), "uniqueIdentifier" : "b2a6ae10-20eb-42c6-bb06-26b77c0bf0cb" }

> db.foodle.aggregate(
... { $sort: { domainName:1, uniqueIdentifier:1 }},
... { $group:{ _id:'$domainName', uniqueIdentifier:{$first:'$uniqueIdentifier'}, thecount:{$sum:1}}},
... { $project:{ _id:0, uniqueIdentifier:1}},
... { $sort: { uniqueIdentifier:1 }}
... )
{
        "result" : [
                {
                        "uniqueIdentifier" : "14f6yu43-20eb-42c6-bb06-26b77c0bf0cb"
                },
                {
                        "uniqueIdentifier" : "b2a6ae10-20eb-42c6-bb06-26b77c0bf0cb"
                },
                {
                        "uniqueIdentifier" : "f96bb647-5dcb-4cc1-8a66-105177a45474"
                }
        ],
        "ok" : 1
}

说我的java是有限的,但我觉得它看起来像这样:

DB db = mongoClient.getDB("test");

DBCollection testCollection = db.getCollection("foodle");

DBObject primarySortFields = new BasicDBObject("domainName", 1);
primarySortFields.put("uniqueIdentifier", 1);
DBObject firstSort = new BasicDBObject("$sort", primarySortFields);

DBObject groupFields = new BasicDBObject("_id", "$domainName");
groupFields.put("uniqueIdentifier", new BasicDBObject("$first","$uniqueIdentifier"));
groupFields.put("thecount", new BasicDBObject("$sum", 1));
DBObject group = new BasicDBObject("$group", groupFields);

DBObject secondSort = new BasicDBObject("$sort", new BasicDBObject("uniqueIdentifier",1));

DBObject fields = new BasicDBObject("_id", 0);
fields.put("uniqueIdentifier", 1);
DBObject project = new BasicDBObject("$project", fields);

AggregationOutput output = testCollection.aggregate(firstSort, group, project, secondSort);

System.out.println(output);

{ "serverUsed" : "/127.0.0.1:27017" , "result" : [ { "uniqueIdentifier" : "14f6yu43-20eb-42c6-bb06-26b77c0bf0cb"} , { "uniqueIdentifier" : "b2a6ae10-20eb-42c6-bb06-26b77c0bf0cb"} , { "uniqueIdentifier" : "f96bb647-5dcb-4cc1-8a66-105177a45474"}] , "ok" : 1.0}