Bash:减少来自大文件的长连续重复模式,包括多行

时间:2017-12-12 16:00:43

标签: regex bash awk sed

我有大文本文件(Reddit转储),当遇到长重复模式时,我的文本挖掘程序的模块崩溃(见下文)。我知道这个问题很复杂,用多个命令解决它可能会更好。我想减少这些重复,即只留下一个例子:“AA AA AA” - > “AA”。

以下是导致问题的字符串(请原谅我的政治和淫秽,这是真实数据的例子 - 我已经清理了最坏的情况):

$ grep -oP "\b(.{25,}?)\1+\b" RS_2017-05.all_ascii_cleaned.txt


HILLARY LOST HILLARY LOST HILLARY LOST HILLARY LOST HILLARY LOST HILLARY LOST HILLARY LOST HILLARY LOST HILLARY LOST HILLARY LOST HILLARY LOST HILLARY LOST 
Bridge Officer Training |       Bridge Officer Training |
        |       Bridge Officer Training |       Bridge Officer Training
BUZZFEED SUCKS BUZZFEED SUCKS BUZZFEED SUCKS BUZZFEED SUCKS BUZZFEED SUCKS BUZZFEED SUCKS BUZZFEED SUCKS BUZZFEED SUCKS BUZZFEED SUCKS BUZZFEED SUCKS BUZZFEED SUCKS BUZZFEED SUCKS BUZZFEED SUCKS BUZZFEED SUCKS 
sumyeonjesumyeonjesumyeonjesumyeonjesumyeonjesumyeonjesumyeonjesumyeonje
TYT SUCKS TYT SUCKS TYT SUCKS TYT SUCKS TYT SUCKS TYT SUCKS TYT SUCKS TYT SUCKS TYT SUCKS TYT SUCKS TYT SUCKS TYT SUCKS TYT SUCKS TYT SUCKS TYT SUCKS 
Y=Y=Y=Y=Y=Y=Y=Y=Y=Y=Y=Y=Y=Y=Y=Y=Y=Y=Y=Y=Y=Y=Y=Y=Y=Y=
HILLARY LOST HILLARY LOST HILLARY LOST HILLARY LOST HILLARY LOST HILLARY LOST HILLARY LOST HILLARY LOST HILLARY LOST HILLARY LOST 
, Martial Skill of Choice, Martial Skill of Choice
BUZZFEED SUCKS BUZZFEED SUCKS BUZZFEED SUCKS BUZZFEED SUCKS 
 him to your house, you brought him to your house, you brought
IMMATURE CUCKS IMMATURE CUCKS IMMATURE CUCKS IMMATURE CUCKS IMMATURE CUCKS IMMATURE CUCKS IMMATURE CUCKS IMMATURE CUCKS IMMATURE CUCKS IMMATURE CUCKS IMMATURE CUCKS IMMATURE CUCKS IMMATURE CUCKS IMMATURE CUCKS 
 him to your house, you brought him to your house, you brought
I clear ball, teammate takes ball and loses possession immediately, opponent shoots. I clear ball, teammate takes ball and loses possession immediately, opponent shoots. 
 http://steamcommunity.com/sharedfiles/filedetails/?id http://steamcommunity.com/sharedfiles/filedetails/?id http://steamcommunity.com/sharedfiles/filedetails/?id http://steamcommunity.com/sharedfiles/filedetails/?id        
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

模式可以包含任何符号。最糟糕的是多线模式,如:

r
r
r
r

01
00
01
00
01

我尝试了什么:

我完成了为一行中的重复写的正则表达式,在grep中工作,但在sed中没有多大帮助并挂起一些为什么:sed -E "s/(.{4,}?)\1+/\1/g" test.txt

长单字符模式“GGGGG ...”可以通过以下方式解决:sed 's/\(.\)\1\+/\1/g' test.txt,但我无法在那里设置最小重复限制。

我发现的代码可以减少单个重复行:sed '$!N; /^\(.*\)\n\1$/!P; D' test.txt,但我无法在那里设置最小限制。

问题也是最后一种情况。当然,在内存中保留太长时间的多行窗口可能会非常慢,但我能否至少有一个参数让我减少重复模式,比如最多三到四行?

编辑:示例我想要实现的目标:

示例输入:

RegExr v3 was created by gskinner.com, and is proudly hosted by Media Temple.

HAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHA
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

Edit the Expression & TextTextTextTextText to TATATATATA see $$$$$$$$$ matches. X X X X X X X X Roll over matches or the expression lolk lolk lolk lolk
lolk
r
r
r
r
r
RADA
RADA
RADA
RADA

JOHN01
BAD
JOHN01
BAD
JOHN01
BAD
JOHN01
BAD

r

Here is some more good text.

ONE TWO ONE TWO ONE TWO ONE TWO ONE TWO ONE TWO ONE TWO ONE TWO ONE TWO

sumyeonjesumyeonjesumyeonjesumyeonjesumyeonjesumyeonjesumyeonjesumyeonje


This is also a good text, but repeated. This is also a good text, but repeated. This is also a good text, but repeated. 

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

以下是我希望如何清理语料库(理想情况):

RegExr v3 was created by gskinner.com, and is proudly hosted by Media Temple.

HA
G

Edit the Expression & Text to TA see $ matches. X Roll over matches or the expression lolk 
lolk
r

RADA


JOHN01
BAD


r

Here is some more good text.

ONE TWO 
sumyeonje


This is also a good text, but repeated.

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

EDIT2:只要我想在我的程序崩溃时消除这种情况,我认为最小重复元素足以检查:

HAHAHA_2_HAHAHA_2_HAHAHA_2 -> HA_2_HA_2_HA_2

如果之后会出现问题,我会记住这个可能的原因,然后再次进行清洁。

如果我在初次清理单词后清洗重复的行,那么一切都应该是OK:

HAHAHAHAHA            HA         HA
RADARADA              RADA       RADA
HAHAHAHAHA     --->   HA    --->
RADARADA              RADA

2 个答案:

答案 0 :(得分:1)

您可能需要多种策略,因为重复的行uniq会起作用 对于2行重复,您可以合并行并在结果上运行uniq。例如

$ cat repeat_line2

01
00
01
00
01
00
01
00

$ awk '{ORS=NR%2?FS:RS}1' repeat_line2 | uniq
01 00

对于同一行上的重复单词,您可以应用反向操作。在运行uniq

之前拆分行
$ cat repeat_words
AA AA AA AA
CC BB CC BB


$ sed 1G repeat_words    |  # double space lines
  tr ' ' '\n'            |  # break words into new lines
  uniq                   |  # remove repeated words
  awk '{ORS=NR%2?FS:RS}1'|  # join two lines
  uniq                      # remove repeated two-words

将给出

AA
CC BB

您可以在单个awk脚本中实现所有这些功能,但我认为通过专用命令进行管理可以更轻松地进行调试/改进。

答案 1 :(得分:1)

这可能是一个开始(GNU sed):

sed -r ':a;s/((\b|[[:punct:]]).+)\s*\1/\1/;ta' file | uniq

这会删除重复的行并将重复的单词减少到最少。