Question

我需要创建代码来计算.txt文件中的单个单词。格式必须类似于：

the - 10
text - 1
has - 5
etc.

我遇到了一个我似乎无法解决的问题：该文本使用撇号作为骚扰，因此我的代码会解析像“不要”这样的词，并且不会看到与“don the”相同的词语。我不知道如何解决这个问题。

这是代码的特定部分。我必须在分隔符中使用正则表达式。

static int findAndCountWords (Scanner scanner, String[] words, int [] freqs)
{
    assert (words != null)&&(freqs != null): "findAndCountWords doesn't work.";
    int nr=0;
    while (scanner.hasNext())
    {   
        String word = scanner.next();
        word = word.toLowerCase();
        scanner.useDelimiter("[^a-z]");
        //|[^a-z]+[\\'][^a-z]+
        if (updateWord(word, words, freqs, nr))
        nr++;
    }
    return nr;
}

Answer 1

我会先从你的话中删除任何撇号。

您可以使用Apache commons执行此操作：

str = StringUtils.stripStart(str,"'")

或您的匹配器：

Pattern pattern = Pattern.compile("(?:^')|(?:'$)); // starts or ends with apostrophe
str = pattern.matcher(str).replaceAll(""); // not anymore

（我没有测试代码，也许是一些bug）

在计算大型文本文件中的单个单词时引用问题

1 个答案: