将字符串拆分为不同字符类型的子字符串列表

时间:2015-11-02 15:43:25

标签: java regex string

我正在编写一个拼写检查程序,它将文本文件作为输入并输出纠正拼写的文件。

该程序应保留格式和标点符号。

我想将输入文本拆分为字符串标记列表,以使每个标记为1或更多:word, punctuation, whitespace, or digit characters

例如:

输入:

  

words.txt:

     

asdf don''。' ..;'' AS12 .... ASDF。
  ASDF

输入列表:

  

[" ASDF" ," " ,"不要" ," " ,']'。' ..;''" ," " ," as" ," 12" ,   " ...." ," asdf" ,"。" ," \ n" ," asdf"]

won'ti'll等字词应视为单个标记。

拥有这种格式的数据可以让我像这样处理令牌:

String output = "";

for(String token : tokens) {
    if(isWord(token)) {
        if(!inDictionary(token)) {
            token = correctSpelling(token);
        }
    }
    output += token;
}

所以我的主要问题是如何将一串文本拆分成如上所述的子串列表?谢谢。

2 个答案:

答案 0 :(得分:0)

这里的主要困难是找到与你认为是" word"相匹配的正则表达式。对于我的例子,我考虑'如果以字母开头或以下字符是字母,则成为单词的一部分:

public static void main(String[] args) {
        String in = "asdf don't ]'.'..;'' as12....asdf.\nasdf";

        //The pattern: 
        Pattern p = Pattern.compile("[\\p{Alpha}][\\p{Alpha}']*|'[\\p{Alpha}]+");

        Matcher m = p.matcher(in);
        //If you want to collect the words
        List<String> words = new ArrayList<String>();

        StringBuilder result = new StringBuilder();

        Now find something from the start
        int pos = 0; 
        while(m.find(pos)) {
            //Add everything from starting position to beginning of word
            result.append(in.substring(pos, m.start()));

            //Handle dictionary logig
            String token = m.group();
            words.add(token); //not used actually
            if(!inDictionary(token)) {
                token = correctSpelling(token);
            }
            //Add to result
            result.append(token);
            //Repeat from end position
            pos = m.end();
        }
        //Append remainder of input
        result.append(in.substring(pos));

        System.out.println("Result: " + result.toString());
    }

答案 1 :(得分:0)

因为我喜欢解决谜题,所以我尝试了以下内容,我觉得它运行良好:

public class MyTokenizer {
    private final String str;
    private int pos = 0;

    public MyTokenizer(String str) {
        this.str = str;
    }

    public boolean hasNext() {
        return pos < str.length();
    }

    public String next() {
        int type = getType(str.charAt(pos));
        StringBuilder sb = new StringBuilder();
        while(hasNext() && (str.charAt(pos) == '\'' || type == getType(str.charAt(pos)))) {
            sb.append(str.charAt(pos));
            pos++;
        }
        return sb.toString();
    }

    private int getType(char c) {
        String sc = Character.toString(c);
        if (sc.matches("\\d")) {
            return 0;
        }
        else if (sc.matches("\\w")) {
            return 1;
        }
        else if (sc.matches("\\s")) {
            return 2;
        }
        else if (sc.matches("\\p{Punct}")) {
            return 3;
        }
        else {
            return 4;
        }
    }

    public static void main(String... args) {
        MyTokenizer mt = new MyTokenizer("asdf don't ]'.'..;'' as12....asdf.\nasdf");
        while(mt.hasNext()) {
            System.out.println(mt.next());
        }
    }
}