Word出现脚本

时间:2014-10-31 02:41:16

标签: awk

我正在编写一个脚本,用于计算文本文档中单词的出现次数。

{
        $0 = tolower($0)
        for ( i = 1; i <= NF; i++ )
        freq[$i]++
}
BEGIN { printf "%-20s %-6s\n", "Word", "Count"}
END {
sort = "sort -k 2nr"
for (word in freq)
        printf "%-20s %-6s\n", word, freq[word] | sort
close(sort)
}

到目前为止它工作正常,但我想做一些调整/补充:

  1. 我很难显示数组索引号,试过freq [$ i]只是向我吐了0个
  2. 有没有办法从字数中消除空格(空格)?

1 个答案:

答案 0 :(得分:1)

您不需要编写自己的循环来扫描字段,只需设置RS以使每个单词成为自己的记录:例如。 RS=[^A-Za-z]会将未完全使用大写和小写字母构建的每个字符串视为记录分隔符。

$ echo 'Hello world! I am happy123...' | awk 'BEGIN{RS="[^A-Za-z]+"}$0'
Hello
world
I
am
happy

$0匹配非空行。

也许您想要允许单词中的数字..只需根据您的需要调整RS

那剩下什么了?

转换为小写,计数,打印排序结果。

档案wfreq.awk

BEGIN { RS="[^A-Za-z]+" }
$0 { counts[tolower($0)]++ }
END{
        printf "%-20s %6s\n", "Word", "Count"
        sort = "sort -k 2nr"
        for(word in counts)
                printf "%-20s %6s\n",word,counts[word] | sort
        close(sort)
}

示例运行(仅排除前10行输出而不发送垃圾邮件答案):

$ awk -f wfreq.awk /etc/motd | head
Word                  Count
the                       5
debian                    3
linux                     3
are                       2
bpo                       2
gnu                       2
in                        2
with                      2
absolutely                1

但现在对于一些并不完全不同的东西......

要按其他字段排序,只需调整sort = "sort ..."选项。

我不使用asort(),因为并非每个awk都有此扩展程序。

档案wfreq2.awk

BEGIN { RS="[^A-Za-z]+" }
$0 { counts[tolower($0)]++ }
END{
        printf "%-20s %6s\n", "Word", "Count"
        sort = "sort -k 1"
        for(word in counts)
                printf "%-20s %6s\n",word,counts[word] | sort
        close(sort)
}

示例运行(仅排除前10行输出而不发送垃圾邮件答案):

$ awk -f wfreq2.awk /etc/motd | head
Word                  Count
absolutely                1
amd                       1
applicable                1
are                       2
bpo                       2
by                        1
comes                     1
copyright                 1
darkstar                  1