Question

我创建了一个Bash脚本，用 grep 和 sed 从文本文件中提取单词，然后用 sort 对它们进行排序并计算重复次数使用 wc ，然后按频率再次排序。示例输出如下所示：

12 the
 7 code
 7 with
 7 add
 5 quite
 3 do
 3 well
 1 quick
 1 can
 1 pick
 1 easy

现在我想将具有相同频率的所有单词合并为一行，如下所示：

12 the
 7 code with add
 5 quite
 3 do well
 1 quick can pick easy

使用Bash和标准Unix工具集有没有办法做到这一点？或者我必须用更复杂的脚本语言编写脚本/程序？

Answer 1

使用awk：

$ echo "12 the
 7 code
 7 with
 7 add
 5 quite
 3 do
 3 well
 1 quick
 1 can
 1 pick
 1 easy" | awk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2} END {for (e in cnt) print e, cnt[e]} ' | sort -nr
12 the
7 code with add
5 quite
3 do well
1 quick can pick easy

你可以用Bash 4关联数组做类似的事情。 awk更容易和POSIX。使用它。

说明：

awk将该行拆分为FS中的分隔符，在本例中为默认的水平空格;
$1是计数的第一个字段 - 用于收集具有相同数量的项目，该关联数组由计数键入cnt[$1];
cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2是三元分配 - 如果cnt[$1]没有值，只需将第二个字段$2分配给它（:的RH）。如果它确实具有先前值，则将$2连接起来，并以OFS的值（:的LH）分隔;
最后，打印出关联数组的值。

由于awk关联数组是无序的，因此需要再次按第一列的数值排序。 gawk可以在内部排序，但调用sort同样容易。 awk的输入不需要排序，因此您可以消除管道的这一部分。

如果您希望数字右对齐（如您的示例所示）：

$ awk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2} 
     END {for (e in cnt) printf "%3s %s\n", e, cnt[e]} '

如果您想gawk到sort numerically by descending values，可以在遍历数组之前添加PROCINFO["sorted_in"]="@ind_num_desc"：

$ gawk '{cnt[$1]=cnt[$1] ? cnt[$1] OFS $2 : $2} 
            END {PROCINFO["sorted_in"]="@ind_num_desc"
               for (e in cnt) printf "%3s %s\n", e, cnt[e]} '

Answer 2

使用单个GNU awk 表达式（没有sort管道）：

awk 'BEGIN{ PROCINFO["sorted_in"]="@ind_num_desc" }
     { a[$1]=(a[$1])? a[$1]" "$2:$2 }END{ for(i in a) print i,a[i]}' file

输出：

12 the
7 code with add
5 quite
3 do well
1 quick can pick easy

使用GNU datamash工具

Bonus 替代解决方案：

datamash -W -g1 collapse 2 <file

输出（以逗号分隔的折叠字段）：

12  the
7   code,with,add
5   quite
3   do,well
1   quick,can,pick,easy

Answer 3

AWK：

awk '{a[$1]=a[$1] FS $2}!b[$1]++{d[++c]=$1}END{while(i++<c)print d[i],a[d[i]]}' file

sed的：

sed -r ':a;N;s/(\b([0-9]+).*)\n\s*\2/\1/;ta;P;D'

Answer 4

您从排序数据开始，因此第一个字段更改时只需要一个新行。

echo "12 the
 7 code
 7 with
 7 add
 5 quite
 3 do
 3 well
 1 quick
 1 can
 1 pick
 1 easy" |
awk '
   {
      if ($1==last) { 
         printf(" %s",$2) 
      } else { 
         last=$1;
         printf("%s%s",(NR>1?"\n":""),$0)
      }
    }; END {print}'

Answer 5

下次当你发现自己尝试使用grep和sed以及shell和...的组合来操作文本时，停止并只使用awk - 最终结果将更清晰，更简单，更高效，更便携等。 ..

# Test dataframe
import numpy as np
import pandas as pd


data = pd.DataFrame({'file': np.repeat(['A', 'B', 'C'], 12),
                     'value_1': np.repeat([1,0,1],12),
                     'value_2': np.random.randint(20, 100, 36)})
# Select a file
data1 = data[data.file == np.random.choice(data['file'].unique())].reset_index(drop=True)

# Get a random index from data1
start_ix = np.random.choice(data1.index[:-3])

# Get a sequence starting at the random index from the previous step
print(data.loc[start_ix:start_ix+3])

$ cat file
It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness.

$ cat tst.awk
BEGIN { FS="[^[:alpha:]]+" }
{
    for (i=1; i<NF; i++) {
        word2cnt[tolower($i)]++
    }
}
END {
    for (word in word2cnt) {
        cnt = word2cnt[word]
        cnt2words[cnt] = (cnt in cnt2words ? cnt2words[cnt] " " : "") word
        printf "%3d %s\n", cnt, word
    }
    for (cnt in cnt2words) {
        words = cnt2words[cnt]
        # printf "%3d %s\n", cnt, words
    }
}
$
$ awk -f tst.awk file | sort -rn
  4 was
  4 the
  4 of
  4 it
  2 times
  2 age
  1 worst
  1 wisdom
  1 foolishness
  1 best

只需取消注释上面脚本中您喜欢的$ cat tst.awk BEGIN { FS="[^[:alpha:]]+" } { for (i=1; i<NF; i++) { word2cnt[tolower($i)]++ } } END { for (word in word2cnt) { cnt = word2cnt[word] cnt2words[cnt] = (cnt in cnt2words ? cnt2words[cnt] " " : "") word # printf "%3d %s\n", cnt, word } for (cnt in cnt2words) { words = cnt2words[cnt] printf "%3d %s\n", cnt, words } } $ $ awk -f tst.awk file | sort -rn 4 it was of the 2 age times 1 best worst wisdom foolishness行，即可获得您想要的任何类型的输出。以上内容适用于任何UNIX系统上的任何awk。

Answer 6

使用miller的nest动词：

mlr -p  nest --implode --values --across-records -f 2 --nested-fs ' ' file

输出：

12 the
7 code with add
5 quite
3 do well
1 quick can pick easy

将字数与Bash和Unix合并

6 个答案: