我使用cat test.txt | grep -o -E '\w+' | sort | uniq -c | sort -nr | head -10
找到前10个重复的单词,但是您能帮我找到前10个重复的2个单词吗?
答案 0 :(得分:0)
这里是GNU awk的一个:
$ awk '
BEGIN {
FS="[,.]? " # add more punctuations
}
{
for(i=1;i<NF;i++) # loop all words in record
a[tolower($i OFS $(i+1))]++ # store word pairs and increase count
}
END {
PROCINFO["sorted_in"]="@val_num_desc" # set the for traverse order
for(i in a) { # loop pairs
print i,a[i] # print pair and count
if(++j==10) # after top-10
exit # guess
}
}' lorem_ipsum.txt # some text
输出:
sit amet 6
ac ultricies 2
tellus donec 2
sed odio 2
sagittis quis 2
est duis 2
vitae luctus 2
donec eu 2
nec tincidunt 2
nullam nec 2
如果“ other” =“ otherthis”,则将a[tolower($i OFS $(i+1))]++
替换为
a[tolower(($i<$(i+1)?$i OFS $(i+1):$(i+1) OFS $i))]++
。