Solr:标记化/搜索域名的一部分

时间:2014-05-13 10:54:47

标签: solr

要求

我的一个文本字段包含(除其他外)域名。鉴于(例如)文本" www.docs.corp.com",我希望能够搜索" www"," docs",&# 34; corp"," com"," www.docs"," docs.corp"," corp.com",& #34; www.docs.corp"," docs.corp.com",或" www.docs.corp.com",找到包含&#34的相关文件; www.docs.corp.com"

我目前的工作:

目前我使用charFilter来改变"。"在使用StandardTokenizerFactory进行标记之前使用空格:

<fieldType name="text_clr" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([.])" replacement=" "/>
    <tokenizer class="solr.StandardTokenizerFactory" />
  </analyzer>
  <analyzer type="query">
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([.])" replacement=" "/>
    <tokenizer class="solr.StandardTokenizerFactory" />
  </analyzer>
</fieldType>

这种作品,但在寻找&#34; corp.com&#34;实际上会寻找&#34; corp com&#34;,因此会找到一些不相关的匹配,例如&#34; ... corp。 com.company.www也将...&#34;当然还有许多其他误报。

假设

认为我需要的是一个代币过滤器:一些将采用令牌的东西&#34; www.docs.corp.com&#34;并从中生成多个令牌:[&#34; www&#34;,&#34; docs&#34;,&#34; corp&#34;,&#34; com&#34;,&#34; www。 docs&#34;,&#34; docs.corp&#34;,&#34; corp.com&#34;,&#34; www.docs.corp&#34;,&#34; docs.corp.com&# 34;,&#34; www.docs.corp.com&#34;]。

问题

这是正确的做法,还是我错过了一些优雅的东西,比现有的过滤器,我可以配置这样做?

2 个答案:

答案 0 :(得分:1)

回答我自己的问题,为了那些可能会寻找类似未来事物的人。

似乎我提出的解决方案确实是要走的路。我已经开始实施它,并在此发布。它由2个类组成:令牌过滤器和令牌过滤器工厂。对Solr中的任何一节来说,用法应该是显而易见的。

我为此做的快速写作链接:http://blog.nitzanshaked.net/solr-domain-name-tokenizer/

文件:

<强> DomainNameTokenFilterFactory.java

package com.clarityray.solr.analysis;

import java.util.Map;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.util.TokenFilterFactory;
import com.clarityray.solr.analysis.DomainNameTokenFilter;

public class DomainNameTokenFilterFactory extends TokenFilterFactory {

    private int minLen;
    private int maxLen;
    private boolean withOriginal;

    public DomainNameTokenFilterFactory(Map<String,String> args) {
        super(args);
        withOriginal = getBoolean(args, "withOriginal", true);
        minLen = getInt(args, "minLen", 2);
        maxLen = getInt(args, "maxLen", -1);
        if (!args.isEmpty())
            throw new IllegalArgumentException("Unknown parameters: " + args);
    }

    @Override
    public TokenStream create(TokenStream ts) {
        return new DomainNameTokenFilter(ts, minLen, maxLen, withOriginal);
    }

}

<强> DomainNameTokenFilter.java

package com.clarityray.solr.analysis;

import java.util.Queue;
import java.util.LinkedList;
import java.io.IOException;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;

public class DomainNameTokenFilter extends TokenFilter {

    private CharTermAttribute charTermAttr;
    private PositionIncrementAttribute posIncAttr;
    private Queue<String> output;
    private int nextPositionIncrement;

    private int minLen;
    private int maxLen;
    private boolean withOriginal;

    public DomainNameTokenFilter(TokenStream ts, int minLen, int maxLen, boolean withOriginal) {
        super(ts);
        this.charTermAttr = addAttribute(CharTermAttribute.class);
        this.posIncAttr = addAttribute(PositionIncrementAttribute.class);
        this.output = new LinkedList<String>();
        this.minLen = minLen;
        this.maxLen = maxLen;
        this.withOriginal = withOriginal;
    }

    private String join(String glue, String[] arr, int start, int end) {
        if (end < start)
            return "";
        StringBuilder sb = new StringBuilder();
        sb.append(arr[start]);
        for (int i = start+1; i <= end; ++i) {
            sb.append(glue);
            sb.append(arr[i]);
        }
        return sb.toString();
    }

    @Override
    public boolean incrementToken() throws IOException {

        // first -- output and ready tokens
        if (!output.isEmpty()) {
            charTermAttr.setEmpty();
            charTermAttr.append(output.poll());
            posIncAttr.setPositionIncrement(0);
            return true;
        }

        // no tokens ready in output buffer? get next token from input stream
        if (!input.incrementToken())
            return false;

        // get the text for the current token
        String s = charTermAttr.toString();

        // if the input does not look like a domain name, we leave it as is
        if (s.indexOf('.') == -1)
            return true;

        // create all sub-sequences
        String[] subParts = s.split("[.]");
        int actualMaxLen = Math.min(
            this.maxLen > 0 ? this.maxLen : subParts.length,
            subParts.length
        );
        for (int currentLen = this.minLen; currentLen <= actualMaxLen; ++currentLen)
            for (int i = 0; i + currentLen - 1 < subParts.length; ++i)
                output.add(join(".", subParts, i, i + currentLen - 1));

        // preserve original if so asked
        if (withOriginal && actualMaxLen < subParts.length)
            output.add(s);

        // output first of the generated tokens
        charTermAttr.setEmpty();
        charTermAttr.append(output.poll());
        posIncAttr.setPositionIncrement(1);
        return true;
    }

}

希望这有助于某人。

答案 1 :(得分:0)

我会将带有p​​reserveOriginal选项的WordDelimiterFilterFactory与WhitespaceTokenizerFactory一起添加

  

preserveOriginal =&#34; 1&#34;导致原始令牌没有索引   修改(除了由于其他产生的令牌之外)   选项)

default is 0

WhitespaceTokenizerFactory将保留这些时间段。然后,当您使用带有preserveOriginal选项的WordDelimiterFilterFactory时,它应该索引组件部分和原始部分。我还考虑添加LowerCaseFilterFactory,否则你可能会在你的索引中出现混合大小写,这可能不是你想要的。

这样的事情,虽然你需要稍微玩一下:

<fieldType name="text_clr" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <charFilter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="1" preserveOriginal="1"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <charFilter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="1" preserveOriginal="1"/>
  </analyzer>
</fieldType>

这可能不会让你一直到那里,但它应该给你一个良好的开端。我将查看此页面以获取有关WordDelimiterFilterFactory的更多详细信息:

https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters