我试图找到一个演示Lucene或其他类型索引的示例,它可以先检查英语和可能重复的姓氏组合。重复检查需要能够考虑常见的昵称,即Robert for Robert和Bill for William,以及拼写错误。有谁知道一个例子?
我计划在用户注册期间执行重复搜索。需要根据从存储用户名的数据库表构建的索引来检查新用户记录。
答案 0 :(得分:2)
我会在索引时在firstName上使用SynonymFilter,以便您拥有所有可能的组合(Bob - > Robert,Robert - > Bob等...)。索引您拥有的现有用户。
然后使用QueryParser(在分析器中没有SynonymFilter)来询问一些模糊查询。
这是我提出的代码:
public class NameDuplicateTests {
private Analyzer analyzer;
private IndexSearcher searcher;
private IndexReader reader;
private QueryParser qp;
private final static Multimap<String, String> firstNameSynonyms;
static {
firstNameSynonyms = HashMultimap.create();
List<String> robertSynonyms = ImmutableList.of("Bob", "Bobby", "Robert");
for (String name: robertSynonyms) {
firstNameSynonyms.putAll(name, robertSynonyms);
}
List<String> willSynonyms = ImmutableList.of("William", "Will", "Bill", "Billy");
for (String name: willSynonyms) {
firstNameSynonyms.putAll(name, willSynonyms);
}
}
public static Analyzer createAnalyzer() {
return new Analyzer() {
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream tokenizer = new WhitespaceTokenizer(reader);
if (fieldName.equals("firstName")) {
tokenizer = new SynonymFilter(tokenizer, new SynonymEngine() {
@Override
public String[] getSynonyms(String s) throws IOException {
return firstNameSynonyms.get(s).toArray(new String[0]);
}
});
}
return tokenizer;
}
};
}
@Before
public void setUp() throws Exception {
Directory dir = new RAMDirectory();
analyzer = createAnalyzer();
IndexWriter writer = new IndexWriter(dir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
ImmutableList<String> firstNames = ImmutableList.of("William", "Robert", "Bobby", "Will", "Anton");
ImmutableList<String> lastNames = ImmutableList.of("Robert", "Williams", "Mayor", "Bob", "FunkyMother");
for (int id = 0; id < firstNames.size(); id++) {
Document doc = new Document();
doc.add(new Field("id", String.valueOf(id), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("firstName", firstNames.get(id), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("lastName", lastNames.get(id), Field.Store.YES, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);
}
writer.close();
qp = new QueryParser(Version.LUCENE_30, "firstName", new WhitespaceAnalyzer());
searcher = new IndexSearcher(dir);
reader = searcher.getIndexReader();
}
@After
public void tearDown() throws Exception {
searcher.close();
}
@Test
public void testNameFilter() throws Exception {
search("+firstName:Bob +lastName:Williams");
search("+firstName:Bob +lastName:Wolliam~");
}
private void search(String query) throws ParseException, IOException {
Query q = qp.parse(query);
System.out.println(q);
TopDocs res = searcher.search(q, 3);
for (ScoreDoc sd: res.scoreDocs) {
Document doc = reader.document(sd.doc);
System.out.println("Found " + doc.get("firstName") + " " + doc.get("lastName"));
}
}
}
结果是:
+firstName:Bob +lastName:Williams
Found Robert Williams
+firstName:Bob +lastName:wolliam~0.5
Found Robert Williams
希望有所帮助!