Question

我想在数十亿字符串中进行一般子字符串搜索。该要求与一般全文搜索略有不同，因为我想查询“ubst”也可以点击“substr”。

Lucene或Sphinx能做到这一点吗？如果没有，你认为最好的方法是什么？

Answer 1

此案例的最佳索引结构为suffix tree Lucene没有实现这种类型的索引，所以它的子字符串搜索很慢。但lucene有前缀树索引，这意味着如果你按前缀搜索术语，你可以快速搜索。

Answer 2

Lucene是最好的选择之一。 Lucene支持子字符串搜索，因此ubst将返回substr。

查看http://wiki.apache.org/lucene-java/LuceneImplementations以了解合适的语言实施情况。

Answer 3

Sphinx支持有效的子字符串搜索，自版本2.0.1-beta，2011年4月22日。不幸的是，截至今天，这种支持仅涉及测试版，如上所述here。

我试用了2.1.1测试版。它似乎工作正常。有关字典类型，请参阅manual entry，了解keywords类型。

当我尝试使用2.0.6发行版时，它回落到效率低下的crc索引，在索引编制过程中发出以下警告：

WARNING: min_infix_len is not supported yet with dict=keywords; using dict=crc

我的最小配置文件：

source sour
{
  type = xmlpipe2
  xmlpipe_command = type C:\Temp\1\sphinx\input.xml
}

index inde
{
  source = sour
  path = testpa
  enable_star = 1
  dict = keywords
  charset_type = utf-8
  min_infix_len = 1
}

为子字符串搜索构建索引？

3 个答案: