UNIX - 使用egrep,如何过滤n次出现的模式?

时间:2016-09-28 21:19:07

标签: linux unix grep

标题说明了一切,我需要使用具有规格的egrep过滤文件,但我无法弄清楚的是确保它出现3次。 (来自问题的直接措辞 - 包含5个或更多字符的单词,在行中至少出现三次)

3 个答案:

答案 0 :(得分:1)

egrep '([a-zA-Z]{5}).*\1.*\1'

这适用于我的快速测试,但我不确定它有多强大

\1(和\2\3 ...)是反向引用。我在模式周围放置了()五个字母[a-zA-Z],这被称为第一个捕获组\1则表示正则表达式希望找到在第一个( - )组内匹配的相同单词的重复。

最后,三个单词之间有一个.*,以便在它们之间出现任何内容

答案 1 :(得分:0)

使用(未经测试):

awk '
  /\b[a-zA-Z]{5}\b/{
    matches[$0]++
  }
  END{
    for (m in matches) {
      if (matches[m] >= 3) {print m}
    }
  }
' file

答案 2 :(得分:0)

$ cat ip.txt 
abc abc abc should not match
totally this line should totally match, isn't it? totally 
Title: word with 5 letters like title should also match, given title is present 3 or more times
this line should not totally match, total only partly matches with totally

匹配具有匹配大小写的单词:

$ grep -wE '([a-zA-Z]{5,}).*\1.*\1' ip.txt 
totally this line should totally match, isn't it? totally 

无论大小写如何匹配单词:

$ grep -iwE '([a-zA-Z]{5,}).*\1.*\1' ip.txt 
totally this line should totally match, isn't it? totally 
Title: word with 5 letters like title should also match, given title is present 3 or more times

匹配五个或更多字母的任何序列:

$ grep -iE '([a-zA-Z]{5,}).*\1.*\1' ip.txt 
totally this line should totally match, isn't it? totally 
Title: word with 5 letters like title should also match, given title is present 3 or more times
this line should not totally match, total only partly matches with totally
  • -E扩展正则表达式
  • -w仅匹配整个单词
  • -i忽略大小写
  • [a-zA-Z]{5,}小写或大写字母,五次或更多次
  • ()捕获组,\1是对它的反向引用
如果你有pcre正则表达式

,那就有点乐趣了
$ echo 'totally title match' | grep -P '([a-zA-Z]{5,}).*(?1).*(?1)'
totally title match
  • (?1)指的是正则表达式模式([a-zA-Z]{5,})本身