我有一套话
happy enjoy dead cheerful
我想在文本文件q.txt
现在我正在使用grep
计算单个单词然后添加它们,但添加更多单词并不是很有效
答案 0 :(得分:3)
words="happy enjoy dead cheerful"
regex=$(set -- $words; IFS='|'; echo "$*")
grep -o -E -w "$regex" q.txt | sort | uniq -c
总数:
while read -r count word; do
(( t += count ))
printf "%8d %s\n" "$count" "$name"
done < <(grep -o -E -w "$regex" q.txt | sort | uniq -c)
echo total is $t
答案 1 :(得分:0)
我会做这样的事情:
将您想要计算的单词放在单独的文件中,words.txt,每行一个。然后,如果您想输出每个单词的计数:
for i in `cat words.txt`; do
echo -n "$i - "
grep -c $i q.txt
done
如果您只想要所有数字的总和,可能是这样的:
for i in `cat words.txt`; do
grep -c $i q.txt
done| awk '{SUM += $1} END {print SUM}'
答案 2 :(得分:0)
使用单个 awk 进程。
此外,我相信,这将大大加快#34;大&#34;与grep
+ sort
+ uniq
相对应的文件:
示例q.txt
:
I thought that the aim of life is to be happy. Till you not dead - you enjoy of life and feeling cheerful.
Just enjoy and then dead ...
Everyone want to be happy. Am I happy?
Just remember that we'll all die. Live like dead man, striving to recreate hisself ... and not just dreaming about cheerful,
enjoy, happy ...
awk -v RS='[,."?!]*[[:space:]]+' '/happy|enjoy|dead|cheerful/{ a[$0]++ }
END{ for(i in a) print i,a[i] }' q.txt
输出:
cheerful 2
enjoy 3
happy 4
dead 3
答案 3 :(得分:0)
时间安排一些答案。
我连接/ usr / share / dict / words多次创建一个大文件
$ ll words
-rw-rw-r-- 1 jackman jackman 653M Sep 19 11:10 words
的grep |排序| uniq的
$ time sh -c 'grep -oEw "happy|enjoy|dead|cheerful" words | sort | uniq -c'
729 cheerful
1458 dead
729 enjoy
729 happy
real 0m2.232s
user 0m2.148s
sys 0m0.084s
AWK
$ time awk -v RS='[,."?!]*[[:space:]]+' '/happy|enjoy|dead|cheerful/{ a[$0]++ } END{ for(i in a) print i,a[i] }' words
deaden 729
deadliness 729
deader 729
deadline 729
deadbeats 729
deadens 729
cheerfuller 729
deadened 729
deadliest 729
enjoyable 729
deadlock's 729
dead's 729
deadbolts 729
cheerfulness 729
deadlier 729
deadbolt's 729
deadbeat's 729
happy 729
deadwood 729
cheerfully 729
enjoyment's 729
deadpan's 729
deadbeat 729
deadbolt 729
deadliness's 729
cheerfullest 729
enjoyments 729
deadlock 729
enjoyment 729
deadpan 729
deadpanned 729
dead 729
enjoy 729
deadest 729
deadpanning 729
deadly 729
enjoys 729
slaphappy 729
unhappy 729
deadlocks 729
deadlines 729
deadpans 729
deadening 729
enjoyed 729
deadlocked 729
deadwood's 729
cheerfulness's 729
deadline's 729
enjoying 729
deadlocking 729
cheerful 729
real 0m46.817s
user 0m46.720s
sys 0m0.228s
awk但是简化了,因为我们知道文件的结构是每行一个单词,并且避免了正则表达式匹配。
$ time awk -v w="happy enjoy dead cheerful" '
BEGIN {n=split(w,a); for (i=1; i<=n; i++) words[a[i]]=1}
$1 in words {count[$1]++}
END {for (word in count) print count[word], word}
' words
729 cheerful
729 enjoy
729 happy
729 dead
real 0m13.781s
user 0m13.652s
sys 0m0.164s
直接字符串相等比较会更快,因为“针”字的列表很短吗?
$ time awk '
$1 == "happy" || $1 == "enjoy" || $1 == "dead" || $1 == "cheerful" {count[$1]++}
END {for (word in count) print count[word], word}
' words
729 cheerful
729 enjoy
729 happy
729 dead
real 0m32.738s
user 0m32.668s
sys 0m0.156s
没有。似乎in
运算符很快。
令人惊讶的是(对我来说),多次greting文件仍然非常快:
$ time sh -c 'for i in happy enjoy dead cheerful; do echo "$(grep -cFx "$i" words) $i"; done'
729 happy
729 enjoy
729 dead
729 cheerful
real 0m2.480s
user 0m2.132s
sys 0m0.348s
无论如何,grep|sort|uniq
管道到目前为止是最快的。
新赢家:多次使用文件,但使用不同的选项:
$ time sh -c 'for i in happy enjoy dead cheerful; do echo "$(grep -cw "$i" words) $i"; done'
729 happy
729 enjoy
1458 dead
729 cheerful
real 0m1.708s
user 0m1.348s
sys 0m0.356s