Question

我在优化以下psedo代码时遇到任何帮助，我们有任何帮助

for every term 
open new index searcher
do search
if found 
skip and search for next term
else
add it to index
commit
close searcher

在上面的代码中，在为索引添加新的doc / term时，我必须提交更改，只需添加一个新文档（我觉得代价高昂），以便看到下次打开新索引搜索器的新更改。

有什么方法可以改善性能。仅供参考：我有3600万个术语需要编入索引。

Answer 1

您可以创建一个HashSet来重复内存中的术语列表，然后仅为这些术语编制索引。伪代码就像这样：

set := new HashSet
for each term
  if set contains term
    skip to next iteration
  else
    add term to set
end
open index
for each term in set
  add term to index
end
close index

Answer 2

我建议您只创建第二个索引（在临时位置的RAMDirectory或FSDirectory中）。将尚未找到的所有条款/文件添加到第二个（临时）索引，并在最后合并两个索引。

open index for searching
for every term
  open new index searcher
  do search
  if found 
    skip and search for next term
  else
    add it to the second index
end
close searcher
commit temp index
merge temp index into primary index 
commit primary index

索引时搜索Solr

2 个答案: