我正在编写一个拼写检查程序,它将文本文件作为输入并输出纠正拼写的文件。
该程序应保留格式和标点符号。
我想将输入文本拆分为字符串标记列表,以使每个标记为1或更多:word, punctuation, whitespace, or digit characters
。
例如:
输入:
words.txt:
asdf don''。' ..;'' AS12 .... ASDF。
ASDF
输入列表:
[" ASDF" ," " ,"不要" ," " ,']'。' ..;''" ," " ," as" ," 12" , " ...." ," asdf" ,"。" ," \ n" ," asdf"]
won't
和i'll
等字词应视为单个标记。
拥有这种格式的数据可以让我像这样处理令牌:
String output = "";
for(String token : tokens) {
if(isWord(token)) {
if(!inDictionary(token)) {
token = correctSpelling(token);
}
}
output += token;
}
所以我的主要问题是如何将一串文本拆分成如上所述的子串列表?谢谢。
答案 0 :(得分:0)
这里的主要困难是找到与你认为是" word"相匹配的正则表达式。对于我的例子,我考虑'如果以字母开头或以下字符是字母,则成为单词的一部分:
public static void main(String[] args) {
String in = "asdf don't ]'.'..;'' as12....asdf.\nasdf";
//The pattern:
Pattern p = Pattern.compile("[\\p{Alpha}][\\p{Alpha}']*|'[\\p{Alpha}]+");
Matcher m = p.matcher(in);
//If you want to collect the words
List<String> words = new ArrayList<String>();
StringBuilder result = new StringBuilder();
Now find something from the start
int pos = 0;
while(m.find(pos)) {
//Add everything from starting position to beginning of word
result.append(in.substring(pos, m.start()));
//Handle dictionary logig
String token = m.group();
words.add(token); //not used actually
if(!inDictionary(token)) {
token = correctSpelling(token);
}
//Add to result
result.append(token);
//Repeat from end position
pos = m.end();
}
//Append remainder of input
result.append(in.substring(pos));
System.out.println("Result: " + result.toString());
}
答案 1 :(得分:0)
因为我喜欢解决谜题,所以我尝试了以下内容,我觉得它运行良好:
public class MyTokenizer {
private final String str;
private int pos = 0;
public MyTokenizer(String str) {
this.str = str;
}
public boolean hasNext() {
return pos < str.length();
}
public String next() {
int type = getType(str.charAt(pos));
StringBuilder sb = new StringBuilder();
while(hasNext() && (str.charAt(pos) == '\'' || type == getType(str.charAt(pos)))) {
sb.append(str.charAt(pos));
pos++;
}
return sb.toString();
}
private int getType(char c) {
String sc = Character.toString(c);
if (sc.matches("\\d")) {
return 0;
}
else if (sc.matches("\\w")) {
return 1;
}
else if (sc.matches("\\s")) {
return 2;
}
else if (sc.matches("\\p{Punct}")) {
return 3;
}
else {
return 4;
}
}
public static void main(String... args) {
MyTokenizer mt = new MyTokenizer("asdf don't ]'.'..;'' as12....asdf.\nasdf");
while(mt.hasNext()) {
System.out.println(mt.next());
}
}
}