我正在尝试使用apache Lucene 5.5.4
为给定的一组输入文本生成N-gram。以下是我执行相同操作的Java代码。
public static void main( String[] args )
{
Analyzer analyzer = createAnalyzer( 2 );
List<String> nGrams = generateNgrams( analyzer, "blah1 blah2 blah3" );
for ( String nGram : nGrams ) {
System.out.println( nGram );
}
}
public static Analyzer createAnalyzer( final int shingles )
{
return new Analyzer() {
@Override
protected TokenStreamComponents createComponents( @NotNull String field )
{
final Tokenizer source = new WhitespaceTokenizer();
final ShingleFilter shingleFilter = new ShingleFilter( new LowerCaseFilter( source ), shingles );
shingleFilter.setOutputUnigrams( true );
return new TokenStreamComponents( source, shingleFilter );
}
};
}
public static List<String> generateNgrams( Analyzer analyzer, String str )
{
List<String> result = new ArrayList<>();
try {
TokenStream stream = analyzer.tokenStream( null, new StringReader( str ) );
stream.reset();
while ( stream.incrementToken() ) {
String nGram = stream.getAttribute( CharTermAttribute.class ).toString();
result.add( nGram );
LOG.debug( "Generated N-gram = {}", nGram );
}
} catch ( IOException e ) {
LOG.error( "IO Exception occured! {}", e );
}
return result;
}
对于我的输入blah1 blah2 blah3
,输出如下,我可以接受。
blah1
blah1 blah2
blah2
blah2 blah3
blah3
但是,当输入为Foo bar Foo2
时,我的要求是生成以下输出:
Foo
Foo bar
bar
bar Foo2
Foo2
如果您注意到了,我必须在输入中保留两个单词之间的空格(Foo bar
,而不是Foo bar
)。
我可以进行任何调整,并要求Lucene在内部进行处理吗?
可能是一个小的调整,例如添加过滤器之类的东西,由于我是Lucene的新手,所以我不知道从哪里开始。 预先感谢。
编辑1
调试Lucene源代码后,我发现在代码中附加了带状疱疹。
if (builtGramSize < gramNum) {
if (builtGramSize > 0) {
gramBuilder.append(tokenSeparator);
}
gramBuilder.append(nextToken.termAtt.buffer(), 0,
nextToken.termAtt.length());
++builtGramSize;
}
gramBuilder.append(tokenSeparator)
是将非令牌(分隔符)附加到输出N-gram的位置。上面的代码是在ShingleFilter.class
和incrementToken()
方法中找到的。
答案 0 :(得分:0)
我必须编写自定义标记器并修剪过滤器以实现此目的。
1)我通过扩展DelimiterPreservingCharTokenizer
类创建了一个抽象类org.apache.lucene.analysis.Tokenizer
。接下来,给出了我对incrementToken
方法的实现。如果不是最后一堂课,我会扩展org.apache.lucene.analysis.util.CharTokenizer
的。 DelimiterPreservingCharTokenizer
如下所示。
package lucene.tokenizers;
import java.io.IOException;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.util.CharTokenizer;
import org.apache.lucene.analysis.util.CharacterUtils;
import org.apache.lucene.analysis.util.CharacterUtils.CharacterBuffer;
import org.apache.lucene.util.AttributeFactory;
/**
*
* @author Arun Gowda.
* This class is exactly same as {@link CharTokenizer}. Except that, the stream will have leading delimiters. This is to support N-gram vicinity matches.
*
* We are creating a new class instead of extending CharTokenizer because, incrementToken method is final and we can not override it.
*
*/
public abstract class DelimiterPreservingCharTokenizer extends Tokenizer
{
/**
* Creates a new {@link DelimiterPreservingCharTokenizer} instance
*/
public DelimiterPreservingCharTokenizer()
{}
/**
* Creates a new {@link DelimiterPreservingCharTokenizer} instance
*
* @param factory
* the attribute factory to use for this {@link Tokenizer}
*/
public DelimiterPreservingCharTokenizer( AttributeFactory factory )
{
super( factory );
}
private int offset = 0, bufferIndex = 0, dataLen = 0, finalOffset = 0;
private static final int MAX_WORD_LEN = 255;
private static final int IO_BUFFER_SIZE = 4096;
private final CharTermAttribute termAtt = addAttribute( CharTermAttribute.class );
private final OffsetAttribute offsetAtt = addAttribute( OffsetAttribute.class );
private final CharacterUtils charUtils = CharacterUtils.getInstance();
private final CharacterBuffer ioBuffer = CharacterUtils.newCharacterBuffer( IO_BUFFER_SIZE );
/**
* Returns true iff a codepoint should be included in a token. This tokenizer
* generates as tokens adjacent sequences of codepoints which satisfy this
* predicate. Codepoints for which this is false are used to define token
* boundaries and are not included in tokens.
*/
protected abstract boolean isTokenChar( int c );
/**
* Called on each token character to normalize it before it is added to the
* token. The default implementation does nothing. Subclasses may use this to,
* e.g., lowercase tokens.
*/
protected int normalize( int c )
{
return c;
}
@Override
public final boolean incrementToken() throws IOException
{
clearAttributes();
int length = 0;
int start = -1; // this variable is always initialized
int end = -1;
char[] buffer = termAtt.buffer();
while ( true ) {
if ( bufferIndex >= dataLen ) {
offset += dataLen;
charUtils.fill( ioBuffer, input ); // read supplementary char aware with CharacterUtils
if ( ioBuffer.getLength() == 0 ) {
dataLen = 0; // so next offset += dataLen won't decrement offset
if ( length > 0 ) {
break;
} else {
finalOffset = correctOffset( offset );
return false;
}
}
dataLen = ioBuffer.getLength();
bufferIndex = 0;
}
// use CharacterUtils here to support < 3.1 UTF-16 code unit behavior if the char based methods are gone
final int c = charUtils.codePointAt( ioBuffer.getBuffer(), bufferIndex, ioBuffer.getLength() );
final int charCount = Character.charCount( c );
bufferIndex += charCount;
if ( isTokenChar( c ) ) { // if it's a token char
if ( length == 0 ) { // start of token
assert start == -1;
start = offset + bufferIndex - charCount;
end = start;
} else if ( length >= buffer.length - 1 ) { // check if a supplementary could run out of bounds
buffer = termAtt.resizeBuffer( 2 + length ); // make sure a supplementary fits in the buffer
}
end += charCount;
length += Character.toChars( normalize( c ), buffer, length ); // buffer it, normalized
if ( length >= MAX_WORD_LEN ) // buffer overflow! make sure to check for >= surrogate pair could break == test
break;
} else if ( length > 0 ) // at non-Letter w/ chars
break; // return 'em
}
if ( length > 0 && bufferIndex < ioBuffer.getLength() ) {//If at least one token is found,
//THIS IS THE PART WHICH IS DIFFERENT FROM LUCENE's CHARTOKENIZER
// use CharacterUtils here to support < 3.1 UTF-16 code unit behavior if the char based methods are gone
int c = charUtils.codePointAt( ioBuffer.getBuffer(), bufferIndex, ioBuffer.getLength() );
int charCount = Character.charCount( c );
bufferIndex += charCount;
while ( !isTokenChar( c ) && bufferIndex < ioBuffer.getLength() ) {// As long as we find delimiter(not token char), keep appending it to output stream.
if ( length >= buffer.length - 1 ) { // check if a supplementary could run out of bounds
buffer = termAtt.resizeBuffer( 2 + length ); // make sure a supplementary fits in the buffer
}
end += charCount;
length += Character.toChars( normalize( c ), buffer, length ); // buffer it, normalized
if ( length >= MAX_WORD_LEN ) {// buffer overflow! make sure to check for >= surrogate pair could break == test
break;
}
c = charUtils.codePointAt( ioBuffer.getBuffer(), bufferIndex, ioBuffer.getLength() );
charCount = Character.charCount( c );
bufferIndex += charCount;
}
//ShingleFilter will add a delimiter. Hence, the last iteration is skipped.
//That is, for input `abc def ghi`, this tokenizer will return `abc `(2 spaces only). Then, Shingle filter will by default add another delimiter making it `abc `(3 spaces as it is in the input).
//If there are N delimiters, this token will at max return N-1 delimiters
bufferIndex -= charCount;
}
termAtt.setLength( length );
assert start != -1;
offsetAtt.setOffset( correctOffset( start ), finalOffset = correctOffset( end ) );
return true;
}
@Override
public final void end() throws IOException
{
super.end();
// set final offset
offsetAtt.setOffset( finalOffset, finalOffset );
}
@Override
public void reset() throws IOException
{
super.reset();
bufferIndex = 0;
offset = 0;
dataLen = 0;
finalOffset = 0;
ioBuffer.reset(); // make sure to reset the IO buffer!!
}
}
2)具体类WhiteSpacePreservingTokenizer
扩展了上述抽象类以提供定界符
package spellcheck.lucene.tokenizers;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.util.AttributeFactory;
/**
*
* @author Arun Gowda
*
* This class is exactly same as {@link WhitespaceTokenizer} Only difference is, it extends DelimiterPreservingCharTokenizer instead of CharTokenizer
*/
public class WhiteSpacePreservingTokenizer extends DelimiterPreservingCharTokenizer
{
/**
* Construct a new WhitespaceTokenizer.
*/
public WhiteSpacePreservingTokenizer()
{}
/**
* Construct a new WhitespaceTokenizer using a given
* {@link org.apache.lucene.util.AttributeFactory}.
*
* @param factory
* the attribute factory to use for this {@link Tokenizer}
*/
public WhiteSpacePreservingTokenizer( AttributeFactory factory )
{
super( factory );
}
/** Collects only characters which do not satisfy
* {@link Character#isWhitespace(int)}.*/
@Override
protected boolean isTokenChar( int c )
{
return !Character.isWhitespace( c );
}
}
3)上面的分词器将导致尾部空格。 (例如:blah____
),我们需要添加一个过滤器以修剪这些空间。因此,我们需要DelimiterTrimFilter
作为跟随项。(我们也可以只使用Java的trim进行修剪。但是这样做会很无效率,因为它会创建新的字符串)
package spellcheck.lucene.filters;
import java.io.IOException;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
public class DelimiterTrimFilter extends TokenFilter
{
private final CharTermAttribute termAtt = addAttribute( CharTermAttribute.class );
private char delimiter;
/**
* Create a new {@link DelimiterTrimFilter}.
* @param in the stream to consume
* @param delimiterToTrim delimiter that should be trimmed
*/
public DelimiterTrimFilter( TokenStream in, char delimiterToTrim )
{
super( in );
this.delimiter = delimiterToTrim;
}
@Override
public boolean incrementToken() throws IOException
{
if ( !input.incrementToken() )
return false;
char[] termBuffer = termAtt.buffer();
int len = termAtt.length();
if ( len == 0 ) {
return true;
}
int start = 0;
int end = 0;
// eat the first characters
for ( start = 0; start < len && termBuffer[start] == delimiter; start++ ) {
}
// eat the end characters
for ( end = len; end >= start && termBuffer[end - 1] == delimiter; end-- ) {
}
if ( start > 0 || end < len ) {
if ( start < end ) {
termAtt.copyBuffer( termBuffer, start, ( end - start ) );
} else {
termAtt.setEmpty();
}
}
return true;
}
}
4)我的createAnalyzer
如下所示
public static Analyzer createAnalyzer( final int shingles )
{
return new Analyzer() {
@Override
protected TokenStreamComponents createComponents( @NotNull String field )
{
final Tokenizer source = new WhiteSpacePreservingTokenizer();
final TokenStream filter = new ShingleFilter( new LowerCaseFilter( source ), shingles );
filter = new DelimiterTrimFilter( filter, ' ' );
return new TokenStreamComponents( source, filter );
}
};
}
其余代码将保持不变