Question

我正在编写一个拼写检查程序，它将文本文件作为输入并输出纠正拼写的文件。

该程序应保留格式和标点符号。

我想将输入文本拆分为字符串标记列表，以使每个标记为1或更多：word, punctuation, whitespace, or digit characters。

例如：

输入：

words.txt：

asdf don＆＃39;＆＃39;。＆＃39; ..;＆＃39;＆＃39; AS12 .... ASDF。
  ASDF

输入列表：

[＆＃34; ASDF＆＃34; ，＆＃34; ＆＃34; ，＆＃34;不要＆＃34; ，＆＃34; ＆＃34; ，＆＃39;]＆＃39;。＆＃39; ..;＆＃39;＆＃39;＆＃34; ，＆＃34; ＆＃34; ，＆＃34; as＆＃34; ，＆＃34; 12＆＃34; ，＆＃34; ....＆＃34; ，＆＃34; asdf＆＃34; ，＆＃34;。＆＃34; ，＆＃34; \ n＆＃34; ，＆＃34; asdf＆＃34;]

won't和i'll等字词应视为单个标记。

拥有这种格式的数据可以让我像这样处理令牌：

String output = "";

for(String token : tokens) {
    if(isWord(token)) {
        if(!inDictionary(token)) {
            token = correctSpelling(token);
        }
    }
    output += token;
}

所以我的主要问题是如何将一串文本拆分成如上所述的子串列表？谢谢。

Answer 1

这里的主要困难是找到与你认为是＆＃34; word＆＃34;相匹配的正则表达式。对于我的例子，我考虑＆＃39;如果以字母开头或以下字符是字母，则成为单词的一部分：

public static void main(String[] args) {
        String in = "asdf don't ]'.'..;'' as12....asdf.\nasdf";

        //The pattern: 
        Pattern p = Pattern.compile("[\\p{Alpha}][\\p{Alpha}']*|'[\\p{Alpha}]+");

        Matcher m = p.matcher(in);
        //If you want to collect the words
        List<String> words = new ArrayList<String>();

        StringBuilder result = new StringBuilder();

        Now find something from the start
        int pos = 0; 
        while(m.find(pos)) {
            //Add everything from starting position to beginning of word
            result.append(in.substring(pos, m.start()));

            //Handle dictionary logig
            String token = m.group();
            words.add(token); //not used actually
            if(!inDictionary(token)) {
                token = correctSpelling(token);
            }
            //Add to result
            result.append(token);
            //Repeat from end position
            pos = m.end();
        }
        //Append remainder of input
        result.append(in.substring(pos));

        System.out.println("Result: " + result.toString());
    }

Answer 2

因为我喜欢解决谜题，所以我尝试了以下内容，我觉得它运行良好：

public class MyTokenizer {
    private final String str;
    private int pos = 0;

    public MyTokenizer(String str) {
        this.str = str;
    }

    public boolean hasNext() {
        return pos < str.length();
    }

    public String next() {
        int type = getType(str.charAt(pos));
        StringBuilder sb = new StringBuilder();
        while(hasNext() && (str.charAt(pos) == '\'' || type == getType(str.charAt(pos)))) {
            sb.append(str.charAt(pos));
            pos++;
        }
        return sb.toString();
    }

    private int getType(char c) {
        String sc = Character.toString(c);
        if (sc.matches("\\d")) {
            return 0;
        }
        else if (sc.matches("\\w")) {
            return 1;
        }
        else if (sc.matches("\\s")) {
            return 2;
        }
        else if (sc.matches("\\p{Punct}")) {
            return 3;
        }
        else {
            return 4;
        }
    }

    public static void main(String... args) {
        MyTokenizer mt = new MyTokenizer("asdf don't ]'.'..;'' as12....asdf.\nasdf");
        while(mt.hasNext()) {
            System.out.println(mt.next());
        }
    }
}

将字符串拆分为不同字符类型的子字符串列表

2 个答案: