Question

我有Set个character分隔符（DELIMITERS），例如. , 等。使用此我想分割文字并在文本中获得他们的位置。如果您只想要单词，String.split()可以正常工作。与StringTokenizer相同。写了一些简单的方法来解决这个问题，但也许有更好的方法来实现这个结果？

public List<String> extractWords(String text){
    List<String> words = new ArrayList<>();
    List<WordPos> positions = new ArrayList<>();
    int wordStart = -1;
    for(int i=0; i < text.length(); i++){
        if(DELIMITERS.contains(text.charAt(i))){
            if(wordStart >=0){ //word just ended
                String word = text.substring(wordStart, i);
                positions.add(new WordPos(wordStart, i));
                words.add(word);
            }
            wordStart = -1;
        }else{ //not delimiter == valid word
            if(wordStart < 0){ //word just started
                wordStart = i;
            }
        }
    }
    return words;
}

// inner static class for words positions
public static class WordPos{
    int start;
    int end;
    public WordPos(int start, int end){
        this.start = start;
        this.end = end;
    }
}

Answer 1

从效率的角度来看，我认为你的解决方案并不差。从美学方面（代码看起来如何），我会使用Apache Commons nStringUtils做类似的事情（还没试过）：

使用以下方法吐出所有代币： splitPreserveAllTokens()
遍历结果数组并存储令牌以及我从调用lastIndexOf获得的每个位置。

Answer 2

List<String> words = new ArrayList<>();
List<WordPos> positions = new ArrayList<>();
int index = 0;
String word = "";
StringTokenizer st = new StringTokenizer("., ");


while(st.hasMoreTokens()) {

word = st.nextToken();
words.add(word);
positions.add(new WordPos(index,index+word.length()));

index+= word.length() +1;
}

通过上述方法，我认为没有2个连续的分隔符。如果发生这种情况，那么方法就完全相同了。

文本标记生成器 - 从文本中提取单词和位置

2 个答案: