我正在尝试对各种文本流进行“翻译”。更具体地说,我需要对输入流进行标记化,查找专用字典中的每个术语并输出令牌的相应“翻译”。但是,我还想保留输入中的所有原始空格,停用词等,以便输出的格式与输入相同,而不是最终成为翻译流。所以,如果我的输入是
Term1:Term2 Stopword! TERM3 Term4
然后我希望输出看起来像
Term1':Term2'停用词! TERM3' Term4'
(其中 Termi'是 Termi 的翻译)而不是简单
Term1'Term2' Term3' Term4'
目前我正在做以下事情:
PatternAnalyzer pa = new PatternAnalyzer(Version.LUCENE_31,
PatternAnalyzer.WHITESPACE_PATTERN,
false,
WordlistLoader.getWordSet(new File(stopWordFilePath)));
TokenStream ts = pa.tokenStream(null, in);
CharTermAttribute charTermAttribute = ts.getAttribute(CharTermAttribute.class);
while (ts.incrementToken()) { // loop over tokens
String termIn = charTermAttribute.toString();
...
}
但是,这当然会丢失所有空格等。如何修改它以便能够将它们重新插入到输出中?非常感谢!
============更新!
我尝试将原始流分为“单词”和“非单词”。它似乎工作正常。但不确定这是否是最有效的方式:
public ArrayList splitToWords(String sIn)
{
if (sIn == null || sIn.length() == 0) {
return null;
}
char[] c = sIn.toCharArray();
ArrayList<Token> list = new ArrayList<Token>();
int tokenStart = 0;
boolean curIsLetter = Character.isLetter(c[tokenStart]);
for (int pos = tokenStart + 1; pos < c.length; pos++) {
boolean newIsLetter = Character.isLetter(c[pos]);
if (newIsLetter == curIsLetter) {
continue;
}
TokenType type = TokenType.NONWORD;
if (curIsLetter == true)
{
type = TokenType.WORD;
}
list.add(new Token(new String(c, tokenStart, pos - tokenStart),type));
tokenStart = pos;
curIsLetter = newIsLetter;
}
TokenType type = TokenType.NONWORD;
if (curIsLetter == true)
{
type = TokenType.WORD;
}
list.add(new Token(new String(c, tokenStart, c.length - tokenStart),type));
return list;
if (sIn == null || sIn.length() == 0) {
return null;
}
char[] c = sIn.toCharArray();
ArrayList<Token> list = new ArrayList<Token>();
int tokenStart = 0;
boolean curIsLetter = Character.isLetter(c[tokenStart]);
for (int pos = tokenStart + 1; pos < c.length; pos++) {
boolean newIsLetter = Character.isLetter(c[pos]);
if (newIsLetter == curIsLetter) {
continue;
}
TokenType type = TokenType.NONWORD;
if (curIsLetter == true)
{
type = TokenType.WORD;
}
list.add(new Token(new String(c, tokenStart, pos - tokenStart),type));
tokenStart = pos;
curIsLetter = newIsLetter;
}
TokenType type = TokenType.NONWORD;
if (curIsLetter == true)
{
type = TokenType.WORD;
}
list.add(new Token(new String(c, tokenStart, c.length - tokenStart),type));
return list;
答案 0 :(得分:0)
它确实没有丢失空格,你仍然有原文:)
所以我认为你应该使用OffsetAttribute,它包含每个术语的startOffset()和endOffset()到原始文本中。例如,这就是lucene用来突出显示原始文本搜索结果片段的内容。
我写了一个快速测试(使用EnglishAnalyzer)来演示: 输入是:
Just a test of some ideas. Let's see if it works.
输出结果为:
just a test of some idea. let see if it work.
// just for example purposes, not necessarily the most performant.
public void testString() throws Exception {
String input = "Just a test of some ideas. Let's see if it works.";
EnglishAnalyzer analyzer = new EnglishAnalyzer(Version.LUCENE_35);
StringBuilder output = new StringBuilder(input);
// in some cases, the analyzer will make terms longer or shorter.
// because of this we must track how much we have adjusted the text so far
// so that the offsets returned will still work for us via replace()
int delta = 0;
TokenStream ts = analyzer.tokenStream("bogus", new StringReader(input));
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
ts.reset();
while (ts.incrementToken()) {
String term = termAtt.toString();
int start = offsetAtt.startOffset();
int end = offsetAtt.endOffset();
output.replace(delta + start, delta + end, term);
delta += (term.length() - (end - start));
}
ts.close();
System.out.println(output.toString());
}