Question

我正在使用美国国家语料库来获得英语单词的频率。文件结构如下（它是一个大文件，大约8 MB）：

Word1   Lemma1  Pos1    Frequency1
Word2   Lemma2  Pos2    Frequency2
Word3   Lemma3  Pos3    Frequency3

这是我的Java类，但它非常慢......如何更改它以加快速度？（我想找到与特定单词相关的频率）

    public static int frequency (String word) throws Exception {

    int ft=0;
    int fc=0;
    int exit=0;
    String frow;
    String[] separated = new String[10];
    String fwordC = "...";
    String fwordP = "...";

    Scanner fscan = new Scanner(new File("./ANC-all-lemma.data"));
    fscan.useDelimiter("\n");

    while(fscan.hasNext()){
        frow = fscan.next();
        separated = frow.split("    ");

        separated[0]= separated[0].replaceAll("(\\r|\\n)", "");
        fwordC = separated[0]; //set current word

        if (fwordC.equalsIgnoreCase(word)) {
            System.out.println("Found!!!");
            return(separated[3]);
        }
    }

}

非常感谢！

Answer 1

你绝对应该尝试用BufferedReader阅读。扫描程序用于解析数据。 BufferedReader还有一个大约8 KB的缓冲区。

ANC中的Java慢搜索（大文件）

1 个答案: