(纯)Lucene:计算具有longfield时间戳的文档,按年份分组

时间:2015-11-05 12:17:27

标签: lucene range facet faceted-search

我的文件结构是: [文本:文本字段,日期:LongField]

我正在根据dateTime字段的精度级别在我的文档上查找'统计'查询。这意味着计算按LongField日期分组的文档,忽略日期右侧的一些字节。

对于给定的精度,我正在寻找有多少文档匹配此精度的每个不同值。

假设精度'year'按“date / 10000”分组 使用以下数据:

{text:"text1",dateTime:(some timestamp where year is 2015 like 20150000)}
{text:"text2",dateTime:(some timestamp where year is 2010 like 20109878)} 
{text:"text3",dateTime:(some timestamp where year is 2015 like 20150024)} 
{text:"text14,dateTime:(some timestamp where year is 1997 like 19970987)}  

结果应为:

[{bracket:1997, count:1}
{bracket:2010, count:1}
{bracket:2015, count:2}]

虽然NumericRangeQuery允许创建1(或某些)范围,但是lucene是否可以根据精确步骤生成范围?

我可以通过为我需要的每个精度级别创建一个新字段来处理这个问题,但也许这种事情已经存在。

这是一种分面搜索,其中的方面是时间。用例应该是:

-give me document count for each milleniums,
-then give me document count for each centuries (inside a millenium)
-then give me document count for each year (inside a century)
-then give me document count for each days (inside a year)

当存储桶中存在0个文档时,结果不应该在结果中。

此致

1 个答案:

答案 0 :(得分:0)

收藏家可以毫无诀窍地做到这一点,这是工作代码:

    public class GroupByTest1 {
    private RAMDirectory directory;
    private IndexSearcher searcher;
    private IndexReader reader;
    private Analyzer analyzer;

    private class Data {
        String text;
        Long dateTime;

        private Data(String text, Long dateTime) {
            this.text = text;
            this.dateTime = dateTime;
        }
    }

    @Before
    public void setUp() throws Exception {
        directory = new RAMDirectory();

        analyzer = new WhitespaceAnalyzer();
        IndexWriter writer = new IndexWriter(directory, new IndexWriterConfig(analyzer));
        Data datas[] = {
                new Data("A", 2012L),
                new Data("B", 2012L),
                new Data("C", 2012L),
                new Data("D", 2013L),
        };

        Document doc = new Document();
        for (Data data : datas) {
            doc.clear();
            doc.add(new TextField("text", data.text, Field.Store.YES));
            doc.add(new LongField("dateTime", data.dateTime, Field.Store.YES));
            writer.addDocument(doc);
        }
        writer.close();

        reader = DirectoryReader.open(directory);
        searcher = new IndexSearcher(reader);
    }


    @Test
    public void test1() throws Exception {
        final Map<Integer, Long> map = new HashMap<>();
        Collector collector = new SimpleCollector() {
            int base = 0;

            @Override
            public void collect(int doc) throws IOException {
                String year = reader.document(doc + base).get("dateTime");
                if (!map.containsKey(Integer.valueOf(year))) {
                    map.put(Integer.valueOf(year), 1L);
                } else {
                    long l = map.get(Integer.valueOf(year));
                    map.put(Integer.valueOf(year), ++l);
                }
            }

            @Override
            public boolean needsScores() {
                return false;
            }

            @Override
            protected void doSetNextReader(LeafReaderContext context) throws IOException {
                base = context.docBase;
            }
        };
        searcher.search(new MatchAllDocsQuery(), collector);
        for (Integer integer : map.keySet()) {
            System.out.print("year = " + integer);
            System.out.println(" count = " + map.get(integer));
        }
    }
}

我得到的输出如下:

year = 2012 count = 3
year = 2013 count = 1

根据您拥有的记录数量,这可能会很慢。它加载每个文档以了解它上面的年份以及基于此的组。还有分组模块,您也可以查看它。