我有一个包含以下文档的Lucene索引:
_id | Name | Alternate Names | Population
123 Bosc de Planavilla (some names here in 5000
345 Planavilla other languages) 20000
456 Bosc de la Planassa 1000
567 Bosc de Plana en Blanca 100000
我应该使用哪种最好的Lucene查询类型,如果我需要以下内容,应该如何构建它:
如果用户查询: “Bosc de Planavilla附近的意大利餐厅” 我希望返回ID为123的文档,因为它包含与文档名称的完全匹配。
如果用户查询: “Planavilla附近的意大利餐厅” 我想要id为345的文档,因为查询包含完全匹配且人口最多。
如果用户查询“Bosc附近的意大利餐厅” 我想要567,因为查询包含“Bosc”和3“Bosc”并且它具有最高的流行音乐。
可能还有很多其他用例......但你会感觉到我需要的东西......
我会做什么样的查询? 我应该生成单词N克(带状疱疹)并使用带状疱疹创建ORed布尔查询然后应用自定义评分?或者常规短语查询会做什么?我也看到了DisjunctionMaxQuery,但不知道它是否正在寻找......
这个想法,正如您现在可能已经理解的那样,是找到用户在其查询中隐含的确切位置。从那以后我可以开始我的地理搜索并添加一些进一步的查询。
最好的方法是什么?
提前致谢。
答案 0 :(得分:1)
你如何标记字段?你把它们存放为完整的字符串吗?另外,你如何解析查询?
好的,所以我正在玩这个。我一直在使用StopFilter删除la,en,de。然后,我使用一个木瓦过滤器来获得多个组合,以便进行“完全匹配”。因此,例如Bosc de Planavilla被标记为[Bosc] [Bosc Planavilla],而Bosc de Plana en Blanca被标记为[Bosc] [Bosc Plana] [Plana Blanca] [Bosc Plana Blanca]。这样您就可以对查询的某些部分进行“完全匹配”。
然后我查询用户传递的确切字符串,尽管也可以进行一些调整。我选择了简单的案例,使结果更符合您的要求。
这是我正在使用的代码(lucene 3.0.3):
public class ShingleFilterTests {
private Analyzer analyzer;
private IndexSearcher searcher;
private IndexReader reader;
public static Analyzer createAnalyzer(final int shingles) {
return new Analyzer() {
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream tokenizer = new WhitespaceTokenizer(reader);
tokenizer = new StopFilter(false, tokenizer, ImmutableSet.of("de", "la", "en"));
if (shingles > 0) {
tokenizer = new ShingleFilter(tokenizer, shingles);
}
return tokenizer;
}
};
}
@Before
public void setUp() throws Exception {
Directory dir = new RAMDirectory();
analyzer = createAnalyzer(3);
IndexWriter writer = new IndexWriter(dir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
ImmutableList<String> cities = ImmutableList.of("Bosc de Planavilla", "Planavilla", "Bosc de la Planassa",
"Bosc de Plana en Blanca");
ImmutableList<Integer> populations = ImmutableList.of(5000, 20000, 1000, 100000);
for (int id = 0; id < cities.size(); id++) {
Document doc = new Document();
doc.add(new Field("id", String.valueOf(id), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("city", cities.get(id), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("population", String.valueOf(populations.get(id)),
Field.Store.YES, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);
}
writer.close();
searcher = new IndexSearcher(dir);
reader = searcher.getIndexReader();
}
@After
public void tearDown() throws Exception {
searcher.close();
}
@Test
public void testShingleFilter() throws Exception {
System.out.println("shingle filter");
QueryParser qp = new QueryParser(Version.LUCENE_30, "city", createAnalyzer(0));
printSearch(qp, "city:\"Bosc de Planavilla\"");
printSearch(qp, "city:Planavilla");
printSearch(qp, "city:Bosc");
}
private void printSearch(QueryParser qp, String query) throws ParseException, IOException {
Query q = qp.parse(query);
System.out.println("query " + q);
TopDocs hits = searcher.search(q, 4);
System.out.println("results " + hits.totalHits);
int i = 1;
for (ScoreDoc dc : hits.scoreDocs) {
Document doc = reader.document(dc.doc);
System.out.println(i++ + ". " + dc + " \"" + doc.get("city") + "\" population: " + doc.get("population"));
}
System.out.println();
}
}
我现在正在考虑按人口分类。
打印出来:
query city:"Bosc Planavilla"
results 1
1. doc=0 score=1.143841 "Bosc de Planavilla" population: 5000
query city:Planavilla
results 2
1. doc=1 score=1.287682 "Planavilla" population: 20000
2. doc=0 score=0.643841 "Bosc de Planavilla" population: 5000
query city:Bosc
results 3
1. doc=0 score=0.5 "Bosc de Planavilla" population: 5000
2. doc=2 score=0.5 "Bosc de la Planassa" population: 1000
3. doc=3 score=0.375 "Bosc de Plana en Blanca" population: 100000
答案 1 :(得分:1)
以下是排序代码。虽然我认为考虑到城市规模而不是对人口进行强制分类来增加自定义评分会更有意义。另请注意,这使用FieldCache,这可能不是内存使用的最佳解决方案。
public class ShingleFilterTests {
private Analyzer analyzer;
private IndexSearcher searcher;
private IndexReader reader;
private QueryParser qp;
private Sort sort;
public static Analyzer createAnalyzer(final int shingles) {
return new Analyzer() {
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream tokenizer = new WhitespaceTokenizer(reader);
tokenizer = new StopFilter(false, tokenizer, ImmutableSet.of("de", "la", "en"));
if (shingles > 0) {
tokenizer = new ShingleFilter(tokenizer, shingles);
}
return tokenizer;
}
};
}
public class PopulationComparatorSource extends FieldComparatorSource {
@Override
public FieldComparator newComparator(String fieldname, int numHits, int sortPos, boolean reversed) throws IOException {
return new PopulationComparator(fieldname, numHits);
}
private class PopulationComparator extends FieldComparator {
private final String fieldName;
private Integer[] values;
private int[] populations;
private int bottom;
public PopulationComparator(String fieldname, int numHits) {
values = new Integer[numHits];
this.fieldName = fieldname;
}
@Override
public int compare(int slot1, int slot2) {
if (values[slot1] > values[slot2]) return -1;
if (values[slot1] < values[slot2]) return 1;
return 0;
}
@Override
public void setBottom(int slot) {
bottom = values[slot];
}
@Override
public int compareBottom(int doc) throws IOException {
int value = populations[doc];
if (bottom > value) return -1;
if (bottom < value) return 1;
return 0;
}
@Override
public void copy(int slot, int doc) throws IOException {
values[slot] = populations[doc];
}
@Override
public void setNextReader(IndexReader reader, int docBase) throws IOException {
/* XXX uses field cache */
populations = FieldCache.DEFAULT.getInts(reader, "population");
}
@Override
public Comparable value(int slot) {
return values[slot];
}
}
}
@Before
public void setUp() throws Exception {
Directory dir = new RAMDirectory();
analyzer = createAnalyzer(3);
IndexWriter writer = new IndexWriter(dir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
ImmutableList<String> cities = ImmutableList.of("Bosc de Planavilla", "Planavilla", "Bosc de la Planassa",
"Bosc de Plana en Blanca");
ImmutableList<Integer> populations = ImmutableList.of(5000, 20000, 1000, 100000);
for (int id = 0; id < cities.size(); id++) {
Document doc = new Document();
doc.add(new Field("id", String.valueOf(id), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("city", cities.get(id), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("population", String.valueOf(populations.get(id)),
Field.Store.YES, Field.Index.NOT_ANALYZED));
writer.addDocument(doc);
}
writer.close();
qp = new QueryParser(Version.LUCENE_30, "city", createAnalyzer(0));
sort = new Sort(new SortField("population", new PopulationComparatorSource()));
searcher = new IndexSearcher(dir);
searcher.setDefaultFieldSortScoring(true, true);
reader = searcher.getIndexReader();
}
@After
public void tearDown() throws Exception {
searcher.close();
}
@Test
public void testShingleFilter() throws Exception {
System.out.println("shingle filter");
printSearch("city:\"Bosc de Planavilla\"");
printSearch("city:Planavilla");
printSearch("city:Bosc");
}
private void printSearch(String query) throws ParseException, IOException {
Query q = qp.parse(query);
System.out.println("query " + q);
TopDocs hits = searcher.search(q, null, 4, sort);
System.out.println("results " + hits.totalHits);
int i = 1;
for (ScoreDoc dc : hits.scoreDocs) {
Document doc = reader.document(dc.doc);
System.out.println(i++ + ". " + dc + " \"" + doc.get("city") + "\" population: " + doc.get("population"));
}
System.out.println();
}
}
这给出了以下结果:
query city:"Bosc Planavilla"
results 1
1. doc=0 score=1.143841[5000] "Bosc de Planavilla" population: 5000
query city:Planavilla
results 2
1. doc=1 score=1.287682[20000] "Planavilla" population: 20000
2. doc=0 score=0.643841[5000] "Bosc de Planavilla" population: 5000
query city:Bosc
results 3
1. doc=3 score=0.375[100000] "Bosc de Plana en Blanca" population: 100000
2. doc=0 score=0.5[5000] "Bosc de Planavilla" population: 5000
3. doc=2 score=0.5[1000] "Bosc de la Planassa" population: 1000