Question

我不是很熟悉Linux，我有一个非常大的文本文件（几个Gigs），我想找到最常用的单词（比如前50名）和每个单词出现次数的计数，并将这些数字输出到文本文件中：

2500 and

我怎么能用Awk做到这一点？（它没有特别使用Awk，但我在Windows 7上使用Cygwin，我不确定还有什么其他东西可用来做这类事情。）

我看过这个问题： https://unix.stackexchange.com/questions/41479/find-n-most-frequent-words-in-a-file

虽然如前所述，我对Linux和管道等不太熟悉，如果有人能解释每个命令的作用，我将不胜感激。

Answer 1

这完全取决于你对＆＃34; word＆＃34;的定义。但如果我们假设它是一个连续的不区分大小写的字母字符序列，那么使用GNU awk的一种方法（这是你用cygwin得到的awk）将是：

awk -v RS='[[:alpha:]]+' '
    RT { cnt[tolower(RT)]++ }
    END {
        PROCINFO["sorted_in"] = "@val_num_desc"
        for (word in cnt) {
            print cnt[word], word
            if (++c == 50) {
                exit
            }
        }
    }
' file

在@ dawgs＆＃39;上运行Tale of Two Cities示例上述输出：

8230 the
5067 and
4140 of
3651 to
3017 a
2660 in
...
440 when
440 been
428 which
399 them
385 what

想要排除上面的of，to，a和in等1个或2个字符的填充词？易：

awk -v RS='[[:alpha:]]+' '
    length(RT)>2 { cnt[tolower(RT)]++ }
    END {
        PROCINFO["sorted_in"] = "@val_num_desc"
        for (word in cnt) {
            print cnt[word], word
            if (++c == 50) {
                exit
            }
        }
    }
' pg98.txt
8230 the
5067 and
2011 his
1956 that
1774 was
1497 you
1358 with
....

对于其他问题，它是while(match()) substr()循环，输出通过sort -n传输到head。

如果那不是您想要的，那么请编辑您的问题以包含一些示例输入和预期输出，以便我们为您提供帮助。

Answer 2

我通过复制整个article创建了一个文件。这个awk one liner可能是一个开始。

awk -v RS="[:punct:]" '{for(i=1;i<=NF;i++) words[$i]++;}END{for (i in words) print words[i]" "i}' file

一块出局：

 1 exploration
 1 day
 1 staggering
 1 these
 2 into
 1 Africans
 4 across
 5 The
 1 head
 1 parasitic
 1 parasitized
 1 discovered
 1 To
 1 both
 1 what
 1 As
 1 inject
 1 hypodermic
 1 succumbing
 1 glass
 1 picked
 1 Observatory
 1 actually

完整版。我使用两个文件，一个包含英文停用词，另一个文件包含我们想要提取最常（50）个单词的文字。

BEGIN {
    FS="[[:punct:] ]";
}
FNR==NR{
    stop_words[$1]++;
    next;
}
{
    for(i=1;i<=NF;i++)
    {
        if (stop_words[$i])
        {
            continue;
        }

        if ($i ~ /[[:alpha:]]+/)# add only if de value is alphabetical
        {
            words[$i]++;
        }
    }
}
END {
    PROCINFO["sorted_in"] = "@val_num_desc"
    for (w in words)
    {
        count++;
        print words[w], w;
        if (count == 50)
        {
            break;
        }
    }
}

如何运行它。 awk -f script.awk english_stop_words.txt big_file.txt

Answer 3

这是一个Python版本：

from collections import Counter

wc=Counter()

with open('tale.txt') as f:
    for line in f:
        wc.update(line.split())

print wc.most_common(50)

在Tale of Two Cities上运行该产品：

[('the', 7514), ('and', 4745), ('of', 4066), ('to', 3458), ('a', 2825), ('in', 2447), ('his', 1911), ('was', 1673), ('that', 1663), ('I', 1446), ('he', 1388), ('with', 1288), ('had', 1263), ('it', 1173), ('as', 1016), ('at', 978), ('you', 895), ('for', 868), ('on', 820), ('her', 818), ('not', 748), ('is', 713), ('have', 703), ('be', 701), ('were', 633), ('Mr.', 602), ('The', 587), ('said', 570), ('my', 568), ('by', 547), ('him', 525), ('from', 505), ('this', 465), ('all', 459), ('they', 446), ('no', 423), ('so', 420), ('or', 418), ('been', 415), ('"I', 400), ('but', 387), ('which', 375), ('He', 363), ('when', 354), ('an', 337), ('one', 334), ('out', 333), ('who', 331), ('if', 327), ('would', 327)]

您还可以使用awk，sort和head来提出模块化/ Unix类型的解决方案：

$ awk '{for (i=1;i<=NF; i++){words[$i]++}}END{for (w in words) print words[w]"\t"w}' tale.txt | sort -n -r | head -n 50
7514    the
4745    and
4066    of
3458    to
2825    a
2447    in
...

无论语言如何，配方都是一样的：

创建associative array字及其频次数
逐行读取文件并逐字添加到关联数组
对数组频率进行排序并打印所需的条目数。

您还需要考虑“单词”是什么。在这种情况下，我只是使用空格作为非空格块之间的分隔符作为“单词”。这意味着And and + "And都是不同的词。分隔标点符号是一个额外的步骤，通常涉及正则表达式。

使用Awk输入大文本文件并输出最常用的文字文本文件？

3 个答案: