我正在编写一个脚本,用于计算文本文档中单词的出现次数。
{
$0 = tolower($0)
for ( i = 1; i <= NF; i++ )
freq[$i]++
}
BEGIN { printf "%-20s %-6s\n", "Word", "Count"}
END {
sort = "sort -k 2nr"
for (word in freq)
printf "%-20s %-6s\n", word, freq[word] | sort
close(sort)
}
到目前为止它工作正常,但我想做一些调整/补充:
答案 0 :(得分:1)
您不需要编写自己的循环来扫描字段,只需设置RS
以使每个单词成为自己的记录:例如。 RS=[^A-Za-z]
会将未完全使用大写和小写字母构建的每个字符串视为记录分隔符。
$ echo 'Hello world! I am happy123...' | awk 'BEGIN{RS="[^A-Za-z]+"}$0'
Hello
world
I
am
happy
单$0
匹配非空行。
也许您想要允许单词中的数字..只需根据您的需要调整RS
。
那剩下什么了?
转换为小写,计数,打印排序结果。
档案wfreq.awk
:
BEGIN { RS="[^A-Za-z]+" }
$0 { counts[tolower($0)]++ }
END{
printf "%-20s %6s\n", "Word", "Count"
sort = "sort -k 2nr"
for(word in counts)
printf "%-20s %6s\n",word,counts[word] | sort
close(sort)
}
示例运行(仅排除前10行输出而不发送垃圾邮件答案):
$ awk -f wfreq.awk /etc/motd | head
Word Count
the 5
debian 3
linux 3
are 2
bpo 2
gnu 2
in 2
with 2
absolutely 1
但现在对于一些并不完全不同的东西......
要按其他字段排序,只需调整sort = "sort ..."
选项。
我不使用asort()
,因为并非每个awk
都有此扩展程序。
档案wfreq2.awk
:
BEGIN { RS="[^A-Za-z]+" }
$0 { counts[tolower($0)]++ }
END{
printf "%-20s %6s\n", "Word", "Count"
sort = "sort -k 1"
for(word in counts)
printf "%-20s %6s\n",word,counts[word] | sort
close(sort)
}
示例运行(仅排除前10行输出而不发送垃圾邮件答案):
$ awk -f wfreq2.awk /etc/motd | head
Word Count
absolutely 1
amd 1
applicable 1
are 2
bpo 2
by 1
comes 1
copyright 1
darkstar 1