生成N-gram,同时在Apache Lucene中保留空格

时间:2019-11-25 17:27:21

标签: indexing lucene tokenize n-gram

我正在尝试使用apache Lucene 5.5.4为给定的一组输入文本生成N-gram。以下是我执行相同操作的Java代码。

public static void main( String[] args )
    {
        Analyzer analyzer = createAnalyzer( 2 );
        List<String> nGrams = generateNgrams( analyzer, "blah1  blah2  blah3" );

        for ( String nGram : nGrams ) {
            System.out.println( nGram );
        }
    }


    public static Analyzer createAnalyzer( final int shingles )
    {
        return new Analyzer() {
            @Override
            protected TokenStreamComponents createComponents( @NotNull String field )
            {
                final Tokenizer source = new WhitespaceTokenizer();
                final ShingleFilter shingleFilter = new ShingleFilter( new LowerCaseFilter( source ), shingles );
                shingleFilter.setOutputUnigrams( true );
                return new TokenStreamComponents( source, shingleFilter );
            }
        };
    }


    public static List<String> generateNgrams( Analyzer analyzer, String str )
    {
        List<String> result = new ArrayList<>();
        try {
            TokenStream stream = analyzer.tokenStream( null, new StringReader( str ) );
            stream.reset();
            while ( stream.incrementToken() ) {
                String nGram = stream.getAttribute( CharTermAttribute.class ).toString();
                result.add( nGram );
                LOG.debug( "Generated N-gram = {}", nGram );
            }
        } catch ( IOException e ) {
            LOG.error( "IO Exception occured! {}", e );
        }
        return result;
    }

对于我的输入blah1 blah2 blah3,输出如下,我可以接受。

  

blah1

     

blah1 blah2

     

blah2

     

blah2 blah3

     

blah3

但是,当输入为Foo bar Foo2时,我的要求是生成以下输出:

  1. Foo
  2. Foo bar
  3. bar
  4. bar Foo2
  5. Foo2

如果您注意到了,我必须在输入中保留两个单词之间的空格(Foo bar,而不是Foo bar)。

我可以进行任何调整,并要求Lucene在内部进行处理吗?

可能是一个小的调整,例如添加过滤器之类的东西,由于我是Lucene的新手,所以我不知道从哪里开始。 预先感谢。

编辑1

调试Lucene源代码后,我发现在代码中附加了带状疱疹。

if (builtGramSize < gramNum) {
          if (builtGramSize > 0) {
            gramBuilder.append(tokenSeparator);
          }
          gramBuilder.append(nextToken.termAtt.buffer(), 0, 
                             nextToken.termAtt.length());
          ++builtGramSize;
    }

gramBuilder.append(tokenSeparator)是将非令牌(分隔符)附加到输出N-gram的位置。上面的代码是在ShingleFilter.classincrementToken()方法中找到的。

1 个答案:

答案 0 :(得分:0)

我必须编写自定义标记器并修剪过滤器以实现此目的。

1)我通过扩展DelimiterPreservingCharTokenizer类创建了一个抽象类org.apache.lucene.analysis.Tokenizer。接下来,给出了我对incrementToken方法的实现。如果不是最后一堂课,我会扩展org.apache.lucene.analysis.util.CharTokenizer的。 DelimiterPreservingCharTokenizer如下所示。

package lucene.tokenizers;

import java.io.IOException;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.util.CharTokenizer;
import org.apache.lucene.analysis.util.CharacterUtils;
import org.apache.lucene.analysis.util.CharacterUtils.CharacterBuffer;
import org.apache.lucene.util.AttributeFactory;


/**
 * 
 * @author Arun Gowda.
 * This class is exactly same as {@link CharTokenizer}. Except that, the stream will have leading delimiters. This is to support N-gram vicinity matches.
 * 
 * We are creating a new class instead of extending CharTokenizer because,  incrementToken method is final and we can not override it.
 *
 */
public abstract class DelimiterPreservingCharTokenizer extends Tokenizer
{


    /**
     * Creates a new {@link DelimiterPreservingCharTokenizer} instance
     */
    public DelimiterPreservingCharTokenizer()
    {}


    /**
     * Creates a new {@link DelimiterPreservingCharTokenizer} instance
     * 
     * @param factory
     *          the attribute factory to use for this {@link Tokenizer}
     */
    public DelimiterPreservingCharTokenizer( AttributeFactory factory )
    {
        super( factory );
    }

    private int offset = 0, bufferIndex = 0, dataLen = 0, finalOffset = 0;
    private static final int MAX_WORD_LEN = 255;
    private static final int IO_BUFFER_SIZE = 4096;

    private final CharTermAttribute termAtt = addAttribute( CharTermAttribute.class );
    private final OffsetAttribute offsetAtt = addAttribute( OffsetAttribute.class );

    private final CharacterUtils charUtils = CharacterUtils.getInstance();
    private final CharacterBuffer ioBuffer = CharacterUtils.newCharacterBuffer( IO_BUFFER_SIZE );


    /**
     * Returns true iff a codepoint should be included in a token. This tokenizer
     * generates as tokens adjacent sequences of codepoints which satisfy this
     * predicate. Codepoints for which this is false are used to define token
     * boundaries and are not included in tokens.
     */
    protected abstract boolean isTokenChar( int c );


    /**
     * Called on each token character to normalize it before it is added to the
     * token. The default implementation does nothing. Subclasses may use this to,
     * e.g., lowercase tokens.
     */
    protected int normalize( int c )
    {
        return c;
    }


    @Override
    public final boolean incrementToken() throws IOException
    {
        clearAttributes();
        int length = 0;
        int start = -1; // this variable is always initialized
        int end = -1;
        char[] buffer = termAtt.buffer();
        while ( true ) {
            if ( bufferIndex >= dataLen ) {
                offset += dataLen;
                charUtils.fill( ioBuffer, input ); // read supplementary char aware with CharacterUtils
                if ( ioBuffer.getLength() == 0 ) {
                    dataLen = 0; // so next offset += dataLen won't decrement offset
                    if ( length > 0 ) {
                        break;
                    } else {
                        finalOffset = correctOffset( offset );
                        return false;
                    }
                }
                dataLen = ioBuffer.getLength();
                bufferIndex = 0;
            }
            // use CharacterUtils here to support < 3.1 UTF-16 code unit behavior if the char based methods are gone
            final int c = charUtils.codePointAt( ioBuffer.getBuffer(), bufferIndex, ioBuffer.getLength() );
            final int charCount = Character.charCount( c );
            bufferIndex += charCount;

            if ( isTokenChar( c ) ) { // if it's a token char
                if ( length == 0 ) { // start of token
                    assert start == -1;
                    start = offset + bufferIndex - charCount;
                    end = start;
                } else if ( length >= buffer.length - 1 ) { // check if a supplementary could run out of bounds
                    buffer = termAtt.resizeBuffer( 2 + length ); // make sure a supplementary fits in the buffer
                }
                end += charCount;
                length += Character.toChars( normalize( c ), buffer, length ); // buffer it, normalized
                if ( length >= MAX_WORD_LEN ) // buffer overflow! make sure to check for >= surrogate pair could break == test
                    break;
            } else if ( length > 0 ) // at non-Letter w/ chars
                break; // return 'em
        }

        if ( length > 0 && bufferIndex < ioBuffer.getLength() ) {//If at least one token is found,

            //THIS IS THE PART WHICH IS DIFFERENT FROM LUCENE's CHARTOKENIZER

            // use CharacterUtils here to support < 3.1 UTF-16 code unit behavior if the char based methods are gone
            int c = charUtils.codePointAt( ioBuffer.getBuffer(), bufferIndex, ioBuffer.getLength() );
            int charCount = Character.charCount( c );
            bufferIndex += charCount;

            while ( !isTokenChar( c ) && bufferIndex < ioBuffer.getLength() ) {// As long as we find delimiter(not token char), keep appending it to output stream.

                if ( length >= buffer.length - 1 ) { // check if a supplementary could run out of bounds
                    buffer = termAtt.resizeBuffer( 2 + length ); // make sure a supplementary fits in the buffer
                }

                end += charCount;

                length += Character.toChars( normalize( c ), buffer, length ); // buffer it, normalized

                if ( length >= MAX_WORD_LEN ) {// buffer overflow! make sure to check for >= surrogate pair could break == test
                    break;
                }

                c = charUtils.codePointAt( ioBuffer.getBuffer(), bufferIndex, ioBuffer.getLength() );
                charCount = Character.charCount( c );
                bufferIndex += charCount;
            }
            //ShingleFilter will add a delimiter. Hence, the last iteration is skipped.
            //That is, for input `abc   def   ghi`, this tokenizer will return `abc  `(2 spaces only). Then, Shingle filter will by default add another delimiter making it `abc   `(3 spaces as it is in the input).
            //If there are N delimiters, this token will at max return N-1 delimiters

            bufferIndex -= charCount;
        }
        termAtt.setLength( length );
        assert start != -1;
        offsetAtt.setOffset( correctOffset( start ), finalOffset = correctOffset( end ) );
        return true;
    }


    @Override
    public final void end() throws IOException
    {
        super.end();
        // set final offset
        offsetAtt.setOffset( finalOffset, finalOffset );
    }


    @Override
    public void reset() throws IOException
    {
        super.reset();
        bufferIndex = 0;
        offset = 0;
        dataLen = 0;
        finalOffset = 0;
        ioBuffer.reset(); // make sure to reset the IO buffer!!
    }
}

2)具体类WhiteSpacePreservingTokenizer扩展了上述抽象类以提供定界符

package spellcheck.lucene.tokenizers;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.util.AttributeFactory;

/**
 * 
 * @author Arun Gowda
 *
 * This class is exactly same as {@link WhitespaceTokenizer} Only difference is, it extends DelimiterPreservingCharTokenizer instead of CharTokenizer
 */
public class WhiteSpacePreservingTokenizer extends DelimiterPreservingCharTokenizer
{

    /**
     * Construct a new WhitespaceTokenizer.
     */
    public WhiteSpacePreservingTokenizer()
    {}


    /**
     * Construct a new WhitespaceTokenizer using a given
     * {@link org.apache.lucene.util.AttributeFactory}.
     *
     * @param factory
     *          the attribute factory to use for this {@link Tokenizer}
     */
    public WhiteSpacePreservingTokenizer( AttributeFactory factory )
    {
        super( factory );
    }


    /** Collects only characters which do not satisfy
     * {@link Character#isWhitespace(int)}.*/
    @Override
    protected boolean isTokenChar( int c )
    {
        return !Character.isWhitespace( c );
    }
}

3)上面的分词器将导致尾部空格。 (例如:blah____),我们需要添加一个过滤器以修剪这些空间。因此,我们需要DelimiterTrimFilter作为跟随项。(我们也可以只使用Java的trim进行修剪。但是这样做会很无效率,因为它会创建新的字符串)

package spellcheck.lucene.filters;

import java.io.IOException;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;


public class DelimiterTrimFilter extends TokenFilter
{


    private final CharTermAttribute termAtt = addAttribute( CharTermAttribute.class );

    private char delimiter;


    /**
     * Create a new {@link DelimiterTrimFilter}.
     * @param in            the stream to consume
     * @param delimiterToTrim delimiter that should be trimmed
     */
    public DelimiterTrimFilter( TokenStream in, char delimiterToTrim )
    {
        super( in );
        this.delimiter = delimiterToTrim;
    }


    @Override
    public boolean incrementToken() throws IOException
    {
        if ( !input.incrementToken() )
            return false;

        char[] termBuffer = termAtt.buffer();
        int len = termAtt.length();

        if ( len == 0 ) {
            return true;
        }
        int start = 0;
        int end = 0;

        // eat the first characters
        for ( start = 0; start < len && termBuffer[start] == delimiter; start++ ) {
        }
        // eat the end characters
        for ( end = len; end >= start && termBuffer[end - 1] == delimiter; end-- ) {
        }
        if ( start > 0 || end < len ) {
            if ( start < end ) {
                termAtt.copyBuffer( termBuffer, start, ( end - start ) );
            } else {
                termAtt.setEmpty();
            }
        }
        return true;
    }


}

4)我的createAnalyzer如下所示

public static Analyzer createAnalyzer( final int shingles )
    {
        return new Analyzer() {
            @Override
            protected TokenStreamComponents createComponents( @NotNull String field )
            {
                final Tokenizer source = new WhiteSpacePreservingTokenizer();
                final TokenStream filter  = new ShingleFilter( new LowerCaseFilter( source ), shingles );
                filter = new DelimiterTrimFilter( filter, ' ' );
                return new TokenStreamComponents( source, filter );
            }
        };
    }

其余代码将保持不变