要求
我的一个文本字段包含(除其他外)域名。鉴于(例如)文本" www.docs.corp.com",我希望能够搜索" www"," docs",&# 34; corp"," com"," www.docs"," docs.corp"," corp.com",& #34; www.docs.corp"," docs.corp.com",或" www.docs.corp.com",找到包含&#34的相关文件; www.docs.corp.com"
我目前的工作:
目前我使用charFilter来改变"。"在使用StandardTokenizerFactory
进行标记之前使用空格:
<fieldType name="text_clr" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([.])" replacement=" "/>
<tokenizer class="solr.StandardTokenizerFactory" />
</analyzer>
<analyzer type="query">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([.])" replacement=" "/>
<tokenizer class="solr.StandardTokenizerFactory" />
</analyzer>
</fieldType>
这种作品,但在寻找&#34; corp.com&#34;实际上会寻找&#34; corp com&#34;,因此会找到一些不相关的匹配,例如&#34; ... corp。 com.company.www也将...&#34;当然还有许多其他误报。
假设
我认为我需要的是一个代币过滤器:一些将采用令牌的东西&#34; www.docs.corp.com&#34;并从中生成多个令牌:[&#34; www&#34;,&#34; docs&#34;,&#34; corp&#34;,&#34; com&#34;,&#34; www。 docs&#34;,&#34; docs.corp&#34;,&#34; corp.com&#34;,&#34; www.docs.corp&#34;,&#34; docs.corp.com&# 34;,&#34; www.docs.corp.com&#34;]。
问题
这是正确的做法,还是我错过了一些优雅的东西,比现有的过滤器,我可以配置这样做?
答案 0 :(得分:1)
回答我自己的问题,为了那些可能会寻找类似未来事物的人。
似乎我提出的解决方案确实是要走的路。我已经开始实施它,并在此发布。它由2个类组成:令牌过滤器和令牌过滤器工厂。对Solr中的任何一节来说,用法应该是显而易见的。
我为此做的快速写作链接:http://blog.nitzanshaked.net/solr-domain-name-tokenizer/
文件:
<强> DomainNameTokenFilterFactory.java 强>
package com.clarityray.solr.analysis;
import java.util.Map;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.util.TokenFilterFactory;
import com.clarityray.solr.analysis.DomainNameTokenFilter;
public class DomainNameTokenFilterFactory extends TokenFilterFactory {
private int minLen;
private int maxLen;
private boolean withOriginal;
public DomainNameTokenFilterFactory(Map<String,String> args) {
super(args);
withOriginal = getBoolean(args, "withOriginal", true);
minLen = getInt(args, "minLen", 2);
maxLen = getInt(args, "maxLen", -1);
if (!args.isEmpty())
throw new IllegalArgumentException("Unknown parameters: " + args);
}
@Override
public TokenStream create(TokenStream ts) {
return new DomainNameTokenFilter(ts, minLen, maxLen, withOriginal);
}
}
<强> DomainNameTokenFilter.java 强>
package com.clarityray.solr.analysis;
import java.util.Queue;
import java.util.LinkedList;
import java.io.IOException;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
public class DomainNameTokenFilter extends TokenFilter {
private CharTermAttribute charTermAttr;
private PositionIncrementAttribute posIncAttr;
private Queue<String> output;
private int nextPositionIncrement;
private int minLen;
private int maxLen;
private boolean withOriginal;
public DomainNameTokenFilter(TokenStream ts, int minLen, int maxLen, boolean withOriginal) {
super(ts);
this.charTermAttr = addAttribute(CharTermAttribute.class);
this.posIncAttr = addAttribute(PositionIncrementAttribute.class);
this.output = new LinkedList<String>();
this.minLen = minLen;
this.maxLen = maxLen;
this.withOriginal = withOriginal;
}
private String join(String glue, String[] arr, int start, int end) {
if (end < start)
return "";
StringBuilder sb = new StringBuilder();
sb.append(arr[start]);
for (int i = start+1; i <= end; ++i) {
sb.append(glue);
sb.append(arr[i]);
}
return sb.toString();
}
@Override
public boolean incrementToken() throws IOException {
// first -- output and ready tokens
if (!output.isEmpty()) {
charTermAttr.setEmpty();
charTermAttr.append(output.poll());
posIncAttr.setPositionIncrement(0);
return true;
}
// no tokens ready in output buffer? get next token from input stream
if (!input.incrementToken())
return false;
// get the text for the current token
String s = charTermAttr.toString();
// if the input does not look like a domain name, we leave it as is
if (s.indexOf('.') == -1)
return true;
// create all sub-sequences
String[] subParts = s.split("[.]");
int actualMaxLen = Math.min(
this.maxLen > 0 ? this.maxLen : subParts.length,
subParts.length
);
for (int currentLen = this.minLen; currentLen <= actualMaxLen; ++currentLen)
for (int i = 0; i + currentLen - 1 < subParts.length; ++i)
output.add(join(".", subParts, i, i + currentLen - 1));
// preserve original if so asked
if (withOriginal && actualMaxLen < subParts.length)
output.add(s);
// output first of the generated tokens
charTermAttr.setEmpty();
charTermAttr.append(output.poll());
posIncAttr.setPositionIncrement(1);
return true;
}
}
希望这有助于某人。
答案 1 :(得分:0)
我会将带有preserveOriginal选项的WordDelimiterFilterFactory与WhitespaceTokenizerFactory一起添加
preserveOriginal =&#34; 1&#34;导致原始令牌没有索引 修改(除了由于其他产生的令牌之外) 选项)
default is 0
WhitespaceTokenizerFactory将保留这些时间段。然后,当您使用带有preserveOriginal选项的WordDelimiterFilterFactory时,它应该索引组件部分和原始部分。我还考虑添加LowerCaseFilterFactory,否则你可能会在你的索引中出现混合大小写,这可能不是你想要的。
这样的事情,虽然你需要稍微玩一下:
<fieldType name="text_clr" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<charFilter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="1" preserveOriginal="1"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<charFilter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="1" preserveOriginal="1"/>
</analyzer>
</fieldType>
这可能不会让你一直到那里,但它应该给你一个良好的开端。我将查看此页面以获取有关WordDelimiterFilterFactory的更多详细信息:
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters