在bash中使用grep计算一组单词的总发生次数

时间:2017-09-19 10:05:55

标签: bash grep

我有一套话 happy enjoy dead cheerful

我想在文本文件q.txt

中计算这些字词的总数

现在我正在使用grep计算单个单词然后添加它们,但添加更多单词并不是很有效

4 个答案:

答案 0 :(得分:3)

words="happy enjoy dead cheerful"
regex=$(set -- $words; IFS='|'; echo "$*")
grep -o -E -w "$regex" q.txt | sort | uniq -c

总数:

while read -r count word; do
    (( t += count ))
    printf "%8d %s\n" "$count" "$name"
done < <(grep -o -E -w "$regex" q.txt | sort | uniq -c)
echo total is $t

答案 1 :(得分:0)

总的来说,你的意思是什么?出场?你想分别输出每一个的总数还是所有单词的总和?

我会做这样的事情:

将您想要计算的单词放在单独的文件中,words.txt,每行一个。然后,如果您想输出每个单词的计数:

for i in `cat words.txt`; do
    echo -n "$i - "
    grep -c $i q.txt
done

如果您只想要所有数字的总和,可能是这样的:

for i in `cat words.txt`; do
    grep -c $i q.txt
done| awk '{SUM += $1} END {print SUM}'

答案 2 :(得分:0)

使用单个 awk 进程。
此外,我相信,这将大大加快#34;大&#34;与grep + sort + uniq相对应的文件:

示例q.txt

I thought that the aim of life is to be happy. Till you not dead -  you enjoy of life and feeling cheerful.
Just enjoy and then dead ...
Everyone want to be happy. Am I happy?
Just remember that we'll all die. Live like dead man, striving to recreate hisself ... and not just dreaming about cheerful, 
enjoy, happy ...
awk -v RS='[,."?!]*[[:space:]]+' '/happy|enjoy|dead|cheerful/{ a[$0]++ }
           END{ for(i in a) print i,a[i] }' q.txt

输出:

cheerful 2
enjoy 3
happy 4
dead 3

答案 3 :(得分:0)

时间安排一些答案。

我连接/ usr / share / dict / words多次创建一个大文件

$ ll words
-rw-rw-r-- 1 jackman jackman 653M Sep 19 11:10 words

的grep |排序| uniq的

$ time sh -c 'grep -oEw "happy|enjoy|dead|cheerful" words | sort | uniq -c'
    729 cheerful
   1458 dead
    729 enjoy
    729 happy

real    0m2.232s
user    0m2.148s
sys 0m0.084s

AWK

$ time awk -v RS='[,."?!]*[[:space:]]+' '/happy|enjoy|dead|cheerful/{ a[$0]++ } END{ for(i in a) print i,a[i] }' words
deaden 729
deadliness 729
deader 729
deadline 729
deadbeats 729
deadens 729
cheerfuller 729
deadened 729
deadliest 729
enjoyable 729
deadlock's 729
dead's 729
deadbolts 729
cheerfulness 729
deadlier 729
deadbolt's 729
deadbeat's 729
happy 729
deadwood 729
cheerfully 729
enjoyment's 729
deadpan's 729
deadbeat 729
deadbolt 729
deadliness's 729
cheerfullest 729
enjoyments 729
deadlock 729
enjoyment 729
deadpan 729
deadpanned 729
dead 729
enjoy 729
deadest 729
deadpanning 729
deadly 729
enjoys 729
slaphappy 729
unhappy 729
deadlocks 729
deadlines 729
deadpans 729
deadening 729
enjoyed 729
deadlocked 729
deadwood's 729
cheerfulness's 729
deadline's 729
enjoying 729
deadlocking 729
cheerful 729

real    0m46.817s
user    0m46.720s
sys 0m0.228s

awk但是简化了,因为我们知道文件的结构是每行一个单词,并且避免了正则表达式匹配。

$ time awk -v w="happy enjoy dead cheerful" '
    BEGIN {n=split(w,a); for (i=1; i<=n; i++) words[a[i]]=1} 
    $1 in words {count[$1]++} 
    END {for (word in count) print count[word], word}
' words
729 cheerful
729 enjoy
729 happy
729 dead

real    0m13.781s
user    0m13.652s
sys 0m0.164s

直接字符串相等比较会更快,因为“针”字的列表很短吗?

$ time awk '                                 
    $1 == "happy" || $1 == "enjoy" || $1 == "dead" || $1 == "cheerful" {count[$1]++} 
    END {for (word in count) print count[word], word}
' words
729 cheerful
729 enjoy
729 happy
729 dead

real    0m32.738s
user    0m32.668s
sys 0m0.156s

没有。似乎in运算符很快。

令人惊讶的是(对我来说),多次greting文件仍然非常快:

$ time sh -c 'for i in happy enjoy dead cheerful; do echo "$(grep -cFx "$i" words) $i"; done'
729 happy
729 enjoy
729 dead
729 cheerful

real    0m2.480s
user    0m2.132s
sys 0m0.348s

无论如何,grep|sort|uniq管道到目前为止是最快的。

新赢家:多次使用文件,但使用不同的选项:

$ time sh -c 'for i in happy enjoy dead cheerful; do echo "$(grep -cw "$i" words) $i"; done'
729 happy
729 enjoy
1458 dead
729 cheerful

real    0m1.708s
user    0m1.348s
sys 0m0.356s