Question

友！

我通过spring-data在java项目中使用MongoDB。我使用Repository接口来访问集合中的数据。对于某些处理，我需要迭代所有集合元素。我可以使用fetchAll方法的存储库，但它总是返回ArrayList。

但是，假设其中一个集合很大 - 至少有100万条记录，每个集合几千字节。我想我不应该在这种情况下使用fetchAll，但我找不到方便的方法返回一些迭代器（可能允许部分获取集合），也没有方便的回调方法。

我看到只支持在页面中检索此类集合。我想知道这是否是处理此类馆藏的唯一途径？

Answer 1

迟到的响应，但可能会在将来帮助某人。）Spring数据不提供任何API来包装 Mongo DB Cursor 功能。它在find方法中使用它，但始终返回完整的对象列表。选项是直接使用Mongo API或使用 Spring Data Paging API ，类似：

        final int pageLimit = 300;
        int pageNumber = 0;
        Page<T> page = repository.findAll(new PageRequest(pageNumber, pageLimit));
        while (page.hasNextPage()) {
            processPageContent(page.getContent());
            page = repository.findAll(new PageRequest(++pageNumber, pageLimit));
        }
        // process last page
        processPageContent(page.getContent());

Answer 2

使用MongoTemplate :: stream（）可能是DBCursor最合适的Java包装器

Answer 3

你仍然可以使用mongoTemplate访问Collection并只使用DBCursor：

     DBCollection collection = mongoTemplate.getCollection("boundary");
     DBCursor cursor = collection.find();        
     while(cursor.hasNext()){
         DBObject obj = cursor.next();
         Object object =  obj.get("polygons");
         ..
      ...
     }

Answer 4

另一种方式：

do{
  page = repository.findAll(new PageRequest(pageNumber, pageLimit));
  pageNumber++;

}while (!page.isLastPage());

Answer 5

检查新方法以处理每个文档的结果。

http://docs.spring.io/spring-data/mongodb/docs/current/api/org/springframework/data/mongodb/core/MongoTemplate.html#executeQuery-org.springframework.data.mongodb.core.query.Query-java.lang.String-org.springframework.data.mongodb.core.DocumentCallbackHandler-

Answer 6

Streams as cursor：

@Query("{}")
Stream<Alarm>  findAllByCustomQueryAndStream();

因此，对于大量数据，您可以流式传输并逐行处理而无需内存限制

Answer 7

你可能想尝试这样的DBCursor方式：

    DBObject query = new BasicDBObject(); //setup the query criteria
    query.put("method", method);
    query.put("ctime", (new BasicDBObject("$gte", bTime)).append("$lt", eTime));

    logger.debug("query: {}", query);

    DBObject fields = new BasicDBObject(); //only get the needed fields.
    fields.put("_id", 0);
    fields.put("uId", 1);
    fields.put("ctime", 1);

    DBCursor dbCursor = mongoTemplate.getCollection("collectionName").find(query, fields);

    while (dbCursor.hasNext()){
        DBObject object = dbCursor.next();
        logger.debug("object: {}", object);
        //do something.
    }

Answer 8

迭代大型集合的最佳方法是直接使用 Mongo API。我使用了下面的代码，它对我的用例来说就像一个魅力。
我不得不迭代超过 1500 万条记录，而且其中一些记录的文档大小很大。
以下代码在 Kotlin Spring Boot App (Spring Boot Version: 2.4.5)

fun getAbcCursor(batchSize: Int, from: Long?, to: Long?): MongoCursor<Document> {

    val collection = xyzMongoTemplate.getCollection("abc")
    val query = Document("field1", "value1")
    if (from != null) {
        val fromDate = Date(from)
        val toDate = if (to != null) { Date(to) } else { Date() }
        query.append(
            "createTime",
            Document(
                "\$gte", fromDate
            ).append(
                "\$lte", toDate
            )
        )
    }
    return collection.find(query).batchSize(batchSize).iterator()
}

然后，从服务层方法中，您可以继续对返回的游标调用 MongoCursor.next() 直到 MongoCursor.hasNext() 返回 true。

重要观察：请不要错过在“FindIterable”（MongoCollection.find() 的返回类型）上添加batchSize。如果您不提供批量大小，游标将获取初始的 101 条记录并在此之后挂起（它会尝试一次获取所有剩余的记录）。
对于我的场景，我使用的批次大小为 2000，因为它在测试过程中给出了最好的结果。优化后的批量大小会受到记录平均大小的影响。

这是 Java 中的等效代码（从查询中删除 createTime，因为它特定于我的数据模型）。

    MongoCursor<Document> getAbcCursor(Int batchSize) {
        MongoCollection<Document> collection = xyzMongoTemplate.getCollection("your_collection_name");
        Document query = new Document("field1", "value1");// query --> {"field1": "value1"}
        return collection.find(query).batchSize(batchSize).iterator();
    }

通过spring-data迭代MongoDB中的大型集合

8 个答案: