Question

我已经问过类似的问题，但人们误解了我的要求。我问的是如何生成每个单词的列表，每个单词每个单词的字数增加一次。

例如，我有一个包含10个文件的目录，我想使用bash命令生成一个单词列表，根据它们出现的文件数量，它们的值为1-10。

例如

10 The
10 and
8 bash
7command
6 help....
ect.

我已经知道grep -l word *|wc -l会搜索一个单词，但我想创建一个包含所有单词的列表。

有没有办法将它与tr '[A-Z]' '[a-z]' | tr -d '[:punct:]'结合起来，这样大写字母的单词就不会重复，并且删除了puntuation？

Answer 1

对每个文件执行以下步骤：

删除标点符号：`tr -d＆＃39; [：punct：]＆＃39;
转换为小写，每行放一个单词：tr 'A-Z ' 'a-z\n'
删除重复的字词：sort -u

然后连接所有这些结果，并计算每个单词的出现次数：sort | uniq -c

因此完整的脚本将如下所示：

for file in *; do
    tr -d '[:punct:]' < "$file" | tr '[A-Z] ' '[a-z]\n' | sort -u
done | sort | uniq -c

Answer 2

如果这些是我们的文件

$ cat file1
hello world
$ cat file2
the quick brown 
fox etc
$ cat file3
HELLO BROWN FOX

然后

grep -o '[[:alpha:]]\+' * | sed 's/:.*/\L&/' | sort -u | cut -d: -f2 | sort | uniq -c
      2 brown
      1 etc
      2 fox
      2 hello
      1 quick
      1 the
      1 world

grep - 提取字母字符序列，并使用文件名和冒号为每个单词添加前缀
sed - 将单词转换为小写，但不转换为文件名（区分“file1”和“File1”）
sort -u - 这样每个文件只显示一个单词
cut - 从输出中删除文件名
sort | uniq -c - 点算

Answer 3

awk解决方案。

awk '
# Clear the "a" array for each new file.
FNR==1 {split("", a)}

{
    # Remove all punctuation.
    gsub(/[[:punct:]]*/, "")

    # Walk over each field.
    for (i=1;i<=NF;i++) {
        # Lowercase each word.
        word=tolower($i)

        # If we have not yet seen this word in this file then add it to our count.
        if (!a[word]) {
            words[word]++
        }

        # Store that we have now seen this word in this file.
        a[word]++
    }
}

END {
    # Loop over all the words and print out the counts.
    for (word in words) {
        print word, words[word]
    }
}' *

lua解决方案（获取每个文件以及总计数）。（您也可以在awk中执行此操作，但由于awk数组不是二维的，因此需要更多循环。）

local fmap = {}
local wmap = {}

for _, file in ipairs(arg) do
    if file ~= arg[0] then
        for line in io.lines(file) do
            line = line:gsub("%p*", "")
            line = line:gsub("%u*", string.lower)
            for word in line:gmatch("%w+") do
                fmap[file] = fmap[file] or {}
                fmap[file][word] = (fmap[file][word] or 0) + 1

                wmap[word] = (wmap[word] or 0) + 1
            end
        end
    end
end
print("# count by word")
for word, count in pairs(wmap) do
    print(count, word)
end
for file, wtab in pairs(fmap) do
    print("# count by word by file for "..file)
    for word, count in pairs(wtab) do
        print(count, word)
    end
end

如何计算目录的所有文件中所有单词的出现次数？但是每个文件每个单词只增加一次计数

3 个答案: