匹配单词列表中的单词并计算出现次数

时间:2013-12-07 19:09:22

标签: bash list sed awk grep

所以我有一个普通的文本文件,其中包含一些写入内容,它实际上是随机的,但我也有一个单词列表,我想与之比较并计算出现在文本文件中的每个单词的出现次数单词列表。

例如,我的单词列表可以由以下内容组成:

good
bad 
cupid
banana
apple

然后我想将这些单词中的每一个与我的文本文件进行比较,这可能是这样的:

Sometimes I travel to the good places that are good, and never the bad places that are bad. For example I want to visit the heavens and meet a cupid eating an apple. Perhaps I will see mythological creatures eating other fruits like apples, bananas, and other good fruits.

我希望我的输出能够生成列出的单词每次出现的次数。我有一种方法可以awkfor-loop,但我真的希望避免for-loop因为它需要永远,因为我的真实单词列表大约有10000个单词。

所以在这种情况下,我的输出应该是(我认为)9,因为它会计算该列表中单词的总出现次数。

顺便说一句,该段落是完全随机的。

4 个答案:

答案 0 :(得分:3)

对于中小型文字,您可以将grepwc结合使用:

cat <<EOF > word.list
good
bad 
cupid
banana
apple
EOF

cat <<EOF > input.txt
Sometimes I travel to the good places that are good, and never the bad places that are bad. For example I want to visit the heavens and meet a cupid eating an apple. Perhaps I will see mythological creatures eating other fruits like apples, bananas, and other good fruits.
EOF

while read search ; do
    echo "$search: $(grep -o $search input.txt | wc -l)" 
done < word.list | awk '{total += $2; print}END{printf "total: %s\n", total}'

输出:

good: 3
bad: 2
cupid: 1
banan: 1
apple: 2
total: 9

答案 1 :(得分:2)

Awk解决方案:

awk -f cnt.awk words.txt input.txt

其中cnt.awk是:

FNR==NR {
    word[$1]=0
    next
}
{
    str=str $0 RS
}
END{
    for (i in word) {
        stri=str
        while(match(stri,i)) {
           stri=substr(stri,RSTART+RLENGTH)
           word[i]++
        }
    }
    for (i in word)
        print i, word[i]
}

答案 2 :(得分:2)

IF 您不需要详细报告,那么这是@ hek2mgl答案的更快版本:

while read word; do
    grep -o $word input.txt
done < words.txt | wc -l

如果您确实需要详细报告,请参阅下一个版本:

while read word; do
    grep -o "$word" input.txt
done < words.txt | sort | uniq -c | awk '{ total += $1; print } END { print "total:", total }'

最后,如果你想匹配完整的单词,那么你需要grep中更严格的模式:

while read word; do
    grep -o "\<$word\>" input.txt
done < words.txt | sort | uniq -c | awk '{ total += $1; print } END { print "total:", total }'

但是,这种模式banana与文本中的bananas不匹配。如果您希望bananabananas匹配,则可以使模式匹配单词的开头如下:

while read word; do
    grep -o "\<$word" input.txt
done < words.txt | sort | uniq -c | awk '{ total += $1; print } END { print "total:", total }'

如果我们同时拨打grep多个字,我不确定是否会更快:

paste -d'|' - - - < words.txt | sed -e 's/ //g' -e 's/\|*$//' | while read words; do
    grep -oE "\<($words)\>" input.txt
done

这将grep一次3个字。您可以尝试为-添加更多paste,以便同时匹配更多字词,例如:

paste -d'|' - - - - - - - - - - < words.txt | ...

无论如何,我想知道@HakonHægland

哪个解决方案最快,这个或awk解决方案

答案 3 :(得分:2)

对于任何更大的文字,我肯定会使用它:

perl -nE'BEGIN{open my$fh,"<",shift;my@a=map lc,map/(\w+)/g,<$fh>;@h{@a}=(0)x@a;close$fh}exists$h{$_}and$h{$_}++for map lc,/(\w+)/g}{for(keys%h){say"$_: $h{$_}";$s+=$h{$_}}say"Total: $s"' word.list input.txt