将句子分为单词和标点符号

时间:2014-01-05 11:51:29

标签: java string split

我需要将类Sentence解析为单词和标点符号(空格被视为标点符号),然后将所有内容添加到常规ArrayList<Sentence>中。

一句例句:

  

一个男人,一个计划,一条运河 - 巴拿马!   A =&gt;字
  whitespase =&gt;标点符号
  man =&gt;字
  ,+ space =&gt;标点符号
  a =&gt;字
  [...]

我试着一次一个字符地阅读整个句子并收集相同内容并从这个集合中创建新单词或新Punctuation

这是我的代码:

public class Sentence {

    private String sentence;
    private LinkedList<SentenceElement> elements;

    /**
     * Constructs a sentence.
     * @param aText a string containing all characters of the sentence
     */
    public Sentence(String aText) {
        sentence = aText.trim();
        splitSentence();
    }

    public String getSentence() {
        return sentence;
    }

    public LinkedList<SentenceElement> getElements() {
        return elements;
    }

    /**
     * Split sentance into words and punctuations
     */
    private void splitSentence() {
        if (sentence == "" || sentence == null || sentence == "\n") {
            return;
        }

        StringBuilder builder = new StringBuilder();

        int j = 0;
        boolean mark = false;
        while (j < sentence.length()) {
            //char current = sentence.charAt(j);

            while (Character.isLetter(sentence.charAt(j))) {
                if (mark) {
                    elements.add(new Punctuation(builder.toString()));
                    builder.setLength(0);
                    mark = false;
                }
                builder.append(sentence.charAt(j));
                j++;
            } 
            mark = true;

            while (!Character.isLetter(sentence.charAt(j))) {
                if (mark) {
                    elements.add(new Word(builder.toString()));
                    builder.setLength(0);
                    mark = false;
                }
                builder.append(sentence.charAt(j));
                j++;
            }
            mark = true;
        }
    }

但splitSentence()的逻辑无法正常工作。我无法找到合适的解决方案。

我想在我们读取第一个字符=&gt;时实现这一点添加到builder =&gt;直到下一个元素是相同的类型(字母或标点符号)继续添加到builder =&gt;当下一个元素与builder =&gt;的内容不同时创建新单词或标点符号并设置构建器以启动。

再次采用相同的逻辑。

如何以正确的方式实施此检查逻辑?

1 个答案:

答案 0 :(得分:3)

在字边界上拆分字符串(第一个除外):

String[] parts = sentence.split("(?<!^)\\b");

数组将包含交替的单词/标点符号/单词/标点符号/单词等。


这是一些测试代码:

String sentence = "A man, a plan, a canal — Panama!";
String[] parts = sentence.split("(?<!^)\\b");
for (String part : parts)
    System.out.println('"' + part + "\" (" + (part.matches("\\w+") ? "word" : "punctuation") + ")");

输出:

"A" (word)
" " (punctuation)
"man" (word)
", " (punctuation)
"a" (word)
" " (punctuation)
"plan" (word)
", " (punctuation)
"a" (word)
" " (punctuation)
"canal" (word)
" — " (punctuation)
"Panama" (word)
"!" (punctuation)