查找Lucene中每个实体的最后一个事件

时间:2016-11-11 15:55:03

标签: java lucene

所以我将事件(文档)存储在Lucene文档存储(版本6.2.1)中。每个文档都有EntityIdTimestamp

可以有许多具有相同EntityId的文档。

我想检索每个Timestamp最新EntityId的文档。

我是否必须完成所有事件并在Java中执行此操作?我看过分面,但据我所知,这只是为了计数,而不是最大/最小类型聚合

2 个答案:

答案 0 :(得分:1)

您尝试执行的操作可以通过工件GroupingSearch中提供的lucene-grouping完成。

GroupingSearch将按提供的组字段(在我们的案例中为EntityId)对您的文档进行分组,这些字段必须在搜索时进行排序,否则您将收到下一个类型的错误:

  

java.lang.IllegalStateException:意外的docvalues类型为NONE   字段'$ {field-name}'(expected = SORTED)。

然后,为了能够获得给定EntityId的最新文档,您还需要对字段Timestamp进行排序。

例如,如果我将文档索引为下一个:

String id = ..
long timestamp = ...
Document doc = new Document();
// The sorted version of my EntityId
doc.add(new SortedDocValuesField("EntityId", new BytesRef(id)));
// The stored version of my EntityId to be able to get its value later if needed
doc.add(new StringField("Id", id, Field.Store.YES));
// The sorted version of my timestamp
doc.add(new NumericDocValuesField("Timestamp", timestamp));
// The stored version of my timestamp to be able to get its value later if needed
doc.add(new StringField("Tsp", Long.toString(timestamp), Field.Store.YES));

然后我可以获得给定EntityId的最新文档作为下一个:

IndexSearcher searcher = ...
// Some random query here I get all docs
Query query = new MatchAllDocsQuery();
// Group the docs by EntityId
GroupingSearch groupingSearch = new GroupingSearch("EntityId");
// Sort the docs of the same group by Timestamp in reversed order to get
// the most recent first
groupingSearch.setSortWithinGroup(
    new Sort(new SortField("Timestamp", SortField.Type.LONG, true))
);
// Set the limit of docs for a given group to 1 as we only want the latest
// NB: This is the default value so it is not required
groupingSearch.setGroupDocsLimit(1);
// Get the 10 first matching groups
TopGroups<BytesRef> result = groupingSearch.search(searcher, query, 0, 10);
// Iterate over the groups found
for (GroupDocs<BytesRef> groupDocs : result.groups) {
    // Iterate over the docs of a given group
    for (ScoreDoc scoreDoc : groupDocs.scoreDocs) {
        // Get the related doc
        Document doc = searcher.doc(scoreDoc.doc);
        // Print the stored value of EntityId and Timestamp
        System.out.printf(
            "EntityId = %s Timestamp = %s%n", doc.get("Id"),  doc.get("Tsp")
        );
    }
}

有关grouping的更多详情。

答案 1 :(得分:0)

您可以尝试使用此类Collapsing query parser(未经测试):

dailyHistogram = pd.DataFrame({'NumVisits':[[np.random.choice([0,1]) for x in range(10)]
                                               for y in range (5)],
                                'DoW': [0]*5}
                              ,columns=['NumVisits','DoF'])

或者你可以用Grouping

实现同样的目标