Question

我试图将输入文件从句子标记为标记（单词）。例如， “这是一个测试文件。”分为五个单词“this”“is”“a”“test”“file”，省略标点符号和空格。并将它们存储在一个arraylist中。我试着写一些像这样的代码：

public static ArrayList<String> tokenizeFile(File in) throws IOException {
    String strLine;
    String[] tokens;
    //create a new ArrayList to store tokens
    ArrayList<String> tokenList = new ArrayList<String>();

    if (null == in) {
        return tokenList;
    } else {
        FileInputStream fStream = new FileInputStream(in);
        DataInputStream dataIn = new DataInputStream(fStream);
        BufferedReader br = new BufferedReader(new InputStreamReader(dataIn));

        while (null != (strLine = br.readLine())) {
            if (strLine.trim().length() != 0) {

                //make sure strings are independent of capitalization and then tokenize them
                strLine = strLine.toLowerCase();

                //create regular expression pattern to split
                //first letter to be alphabetic and the remaining characters to be alphanumeric or '
                String pattern = "^[A-Za-z][A-Za-z0-9'-]*$";
                tokens = strLine.split(pattern);
                int tokenLen = tokens.length;

                for (int i = 1; i <= tokenLen; i++) {
                    tokenList.add(tokens[i - 1]);
                }
            }
        }
        br.close();
        dataIn.close();
    }
    return tokenList;
}

这段代码工作正常，但我发现不是将整个文件分成几个单词（标记），而是将整行写成一个标记。 “区域区域”成为令牌，而不是“区域”出现两次。我在代码中没有看到错误。我相信我的trim()可能有问题。任何有价值的建议表示赞赏。非常感谢你。

也许我应该使用扫描仪？我很困惑。

Answer 1

我认为Scanner更适合这项任务。对于此代码，您应该修复正则表达式，尝试"\\s+";

Answer 2

在同一代码

中尝试将模式设为

gcc: /usr/bin/gcc /usr/lib/gcc /usr/bin/X11/gcc

如何将文件拆分为多个标记

2 个答案: