Question

我有一个像这样的文本文件：

tom
and
jerry
went
to
america
and
england

我想知道每个单词的出现频率。

当我尝试以下命令时

sort test.txt|uniq -c

我得到以下输出

   1 america
   2 and
   1 england
   1 jerry
   1 to
   1 tom
   1 went

但是我也需要部分比赛。也就是说，单词to中存在单词tom。因此，我期望的to的字数是2。可以使用unix命令吗？

Answer 1

$ cat tst.awk
NR==FNR {
    cnt[$1] = 0
    next
}
{
    for (word in cnt) {
        cnt[word] += gsub(word,"&")
    }
}
END {
    for (word in cnt) {
        print word, cnt[word]
    }
}

$ awk -f tst.awk file file
went 1
america 1
to 2
and 3
england 1
jerry 1
tom 1

由于您在有关RAM不足的评论中提到，如果您没有足够的RAM来一次将文件中的所有唯一单词存储在内存中，请按N（10？100？1000）循环执行以上操作？）一次输入数千个字，例如（类似bash的伪代码）：

sort -u file > tmp
for (( i=1; i<=$(wc -l < tmp); i+=10000 )); do
    awk -f tst.awk <(head -n "$i" tmp | tail -n 10000) file
done

Answer 2

您可以为文件中的每个唯一单词调用grep：

while IFS= read -r pattern; do
    count="$(grep -o "$pattern" test.txt | wc -l)" # can't use grep -c as it counts lines
    printf '%s: %d\n' "$pattern" "$count"
done < <(sort test.txt | uniq)

Answer 3

脚本：

#!/bin/bash

while IFS= read -r word; do
    count=`grep -o "${word}" file | wc -l`
    echo "${word} : ${count}"
done < file

输出：

tom：1 和：3 杰瑞：1 去了：1 至：2 美国：1 和：3 英国：1

Answer 4

如果有的话，Perl是为这样的事情制作的：

$ perl -e '@lines=<>;for $x(@lines){chomp $x;print 0+grep(/$x/,@lines), " $x\n"}' text_file
1 tom
3 and
1 jerry
1 went
2 to
1 america
3 and
1 england

列表上下文中的<>一次将所有行读入数组。

chomp除去结尾的换行符。

0+将grep放在标量上下文中，在标量上下文中，它的计算结果仅为计数。

使用部分匹配从文件中获取单词频率

4 个答案: