Question

我正在使用RegexQuery对Lucene索引执行查询操作。例如，我使用new RegexQuery(new Term("text", "https?://[^\\s]+")执行RegexQuery以获取包含URL的所有文档（我知道，RegEx过度简化）。

现在我想要检索实际与我的查询匹配的文本片段，例如http://example.com。 Lucene是否提供了有效的可能性？或者我是否使用Java的RegEx匹配器再次处理整个文本？

Answer 1

我认为你想要的东西是不可能的，但这是一种具有类似效果的不同方法：

打开索引阅读器，获取“http”之后的所有术语（按字典顺序1排序），直到它们不再以“http：//”或“https：//”开头：< / p>

    final IndexReader reader = IndexReader.open(IndexHelper.DIRECTORY, true);
    final TermEnum termEnum = reader.terms(new Term("text", "http"));
    final List<Term> terms = new ArrayList<Term>();
    Term foundTerm = termEnum.term();

    // if the first term does not match url pattern: advance until it first matches
    if (!(foundTerm.text().startsWith("https://") || foundTerm.text().startsWith("http://"))) {
        while (termEnum.next()) {
            foundTerm = termEnum.term();
            if (foundTerm.text().startsWith("https://") || foundTerm.text().startsWith("http://")) {
                break;
            }
        }
    }
    // collect all terms
    while ((foundTerm.text().startsWith("https://") || foundTerm.text().startsWith("http://")) && termEnum.next()) {
        foundTerm = termEnum.term();
        terms.add(foundTerm);
    }

然后，生成的网址在“条款”列表中，作为lucene条款。

这当然有一个缺点，就是你没有找到找到这些网址的文件，但你可以使用找到的条款再次查询它们。

我在这里铺设它的方式不是很灵活（但可能更快地完成任务），但你当然可以回到模式以获得更大的灵活性。然后，您将所有foundTerm.text().startsWith("https://") || foundTerm.text().startsWith("http://")替换为yourPattern.matches(foundTerm.text())。

抱歉我写了这么多^^。

我希望它有所帮助。

使用Lucene的RegexQuery时匹配的片段

1 个答案: