使用正则表达式在文本中查找重复印刷错误

时间:2013-01-08 14:11:38

标签: regex

是否有可能在文本中找到所有重复印刷错误(在我的案例中是一个LaTeX来源),例如:

... The Lagrangian that that includes this potential ...
... This is confimided by the the theorem of ...

使用正则表达式?

使用您喜欢的工具(sed,grep)/语言(python,perl,...)

4 个答案:

答案 0 :(得分:1)

使用egrep -w和正则表达式(\w+)\s+\1的反向引用:

$ echo "The Lagrangian that that includes this potential" | egrep -ow "(\w+)\s\1"
that that

$ echo "This is confimided by the the theorem of" | egrep -ow "(\w+)\s+\1"
the the

注意:-o选项显示匹配行的唯一部分,这对于演示实际匹配的内容非常有用,您可能希望删除该选项并改为使用--color-w选项对于匹配整个单词非常重要,否则is is会匹配This is con..

(\w+) # Matches & captures one or more word characters ([A-Za-z0-9_])
\s+   # Match one or more whitespace characters 
\1    # The last captured word  

使用egrep -w --color "(\w+)\s+\1" file有明显突出显示潜在错误重复单词的好处,替换可能不明智,因为许多正确的示例(例如reggae raggae saucebeautiful beautiful day)会被更改。

答案 1 :(得分:1)

此JavaScript示例有效:

var s = '... The Lagrangian that that includes this potential ... This is confimided by the the theorem of ...'
var result = s.match(/\b(\w+)\s\1\b/gi)

结果:

["that that", "the the"];

正则表达式:

/\s(\w+)\s\1/gi

# /     --> Regex start,
# \b    --> A word boundary,
# (\w+) --> Followed by a word, grouped,
# \s    --> Followed by a space,
# \1    --> Followed by the word in group 1,
# \b    --> Followed by a word boundary,
# /gi   --> End regex, (g)lobal flag, case (i)nsensitive flag.

添加单词边界是为了防止正则表达式匹配"hot hotel""nice ice"等字符串

答案 2 :(得分:1)

试试这个:

grep -E '\b(\w+)\s+\1\b'  myfile.txt

答案 3 :(得分:0)

Python中的一个示例,说明如何删除重复的单词:

In [1]: import re

In [2]: s1 = '... The Lagrangian that that includes this potential ...'

In [3]: s2 = '... This is confimided by the the theorem of ...'

In [4]: regex = r'\b(\w+)\s+\1\b'

In [5]: re.sub(regex, '\g<1>', s1)
Out[5]: '... The Lagrangian that includes this potential ...'

In [6]: re.sub(regex, '\g<1>', s2)
Out[6]: '... This is confimided by the theorem of ...'