Question

是否可以编写一行（一行）grep表达式来查找包含三次出现的同一个单词的行？请注意，我们不知道先验词。以下代码段捕获了大多数情况：

$ grep -E '(\w+)[[:space:]]+\1[[:space:]]+\1' test_data.txt

然而，这并没有抓住以下积极的例子：

午餐晚餐晚餐晚餐午餐

另请注意，我们只是在寻找完整的单词，而不仅仅是字符重复。所以一个反面例子的例子是：

他采摘他鲜花他重新

编辑（感谢@ lev-levitsky）：

上面的正面例子实际上已被捕获，但以下情况并非如此：

午餐午餐晚餐晚餐午餐

Answer 1

这应该适合你：

grep -E "[[:<:]](\w+)[[:>:]].*[[:<:]]\1[[:>:]].*[[:<:]]\1[[:>:]]" testfile

例如：

paul@horus:~/src/sandbox$ cat testfile
how is summer summer summer ha ha
this summer is a hot summer of summers yes it is
summer summer summer
there is only one summer in this sentence
summer appears as the first and last summer words in this summer
the summertime is always in summer, one of several summers
the summer of which we speak is summery but is a real summer summer, yes
this also works with cats, since there are three cats in these cats, ha!
paul@horus:~/src/sandbox$ grep -E "[[:<:]](\w+)[[:>:]].*[[:<:]]\1[[:>:]].*[[:<:]]\1[[:>:]]" testfile
how is summer summer summer ha ha
summer summer summer
summer appears as the first and last summer words in this summer
the summer of which we speak is summery but is a real summer summer, yes
this also works with cats, since there are three cats in these cats, ha!
paul@horus:~/src/sandbox$

[[:<:]]和[[:>:]]分别匹配单词开头和结尾的空字符串，因此您可以使用它们来确定单词边界，而不必假设单词是由空格分隔的，而是而不是标点字符等。

Answer 2

这不是grep也不是regex，但可能有效：

awk -F"[,. \t]*" '{for (i=1;i<=NF;i++) {if (++a[$i]==3) {printf "%s ",$i;f=1}} if (f) print "";f=0;delete a}' file

计算每一行的单词，如果找到三个或更多单词，则在该行上打印单词。

是否可以使用grep查找具有多个相同单词出现的行？

2 个答案: