我有一个大文本文件(14MB)。我需要删除文件中的文本块,包含5个重复的行。
如果可以使用任何免费方法,那将是很好的。
我使用Windows,但Cygwin解决方案也不错。
I have a file test1.md
。它由重复块组成。每个街区有10条线。文件结构(使用PCRE正则表达式)
Millionaire
\d{18}
QUESTION.*
.*
.*
.*
.*
.*
.*
.*
Millionaire
\d{18}
QUESTION.*
.*
.*
.*
.*
.*
.*
.*
test1.md
除了10行块之外没有其他行和文本。它没有空白行和块数大于或小于10的行。
Millionaire
123456788763237476
QUESTION|2402394827049882049
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
AuthorOfQuestion
Millionaire
459385734954395394
QUESTION|9845495845948594999
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author
Millionaire
778845225202502505
QUESTION|984ACFBBADD8594999A
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
Millionaire
AuthorOfQuestion
Millionaire
903034225025025568
QUESTION|ABC121980850540445C
Another question.
Katya
Sasha
Kazan
Chistopol
Katya
Unknown author
Millionaire
450602938477581129
QUESTION|453636EE4534345AC5E
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author
从示例中可以看出,test1.md
重复了7行块。例如,这些块是:
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
AuthorOfQuestion
和
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author
我需要删除所有重复块。在我的例子中,我需要得到:
Millionaire
123456788763237476
QUESTION|2402394827049882049
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
AuthorOfQuestion
Millionaire
459385734954395394
QUESTION|9845495845948594999
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author
Millionaire
778845225202502505
QUESTION|984ACFBBADD8594999A
Millionaire
903034225025025568
QUESTION|ABC121980850540445C
Another question.
Katya
Sasha
Kazan
Chistopol
Katya
Unknown author
Millionaire
450602938477581129
QUESTION|453636EE4534345AC5E
Sasha
,Kazan
,Chistopol
和Katya
重复,但这些字词不会删除。sort
,sed
和awk
可以解决类似的任务,但我没有找到,我如何使用这些命令来解决我的任务。答案 0 :(得分:2)
这是解决您问题的简单方法(如果您可以访问GNU sed
,sort
和uniq
):
sed 's/^Millionaire/\x0&/' file | sort -z -k4 | uniq -z -f3 | tr -d '\000'
按顺序解释一下:
Millionaire
开头,我们可以使用它来将文件分成(可变长)块,方法是在NUL
字符前加上Millionaire
每个NUL
; -z
- 分隔的块(使用Millionaire
标志)进行排序,但忽略前3个字段(在这种情况下为行:\d+
,QUESTION|ID...
,-k
),使用--key
/ 4
选项,其中起始位置为字段uniq
(在您的情况下为第4行),停止位置为结束块; NUL
过滤掉重复项,再次使用-z
分隔符而不是换行符(-f
),并忽略前3个字段(使用{{ 1}} / --skip-fields
); NUL
删除tr
分隔符。通常,只要有将文件拆分为块的方法,删除这样的重复块的解决方案就应该有效。请注意,块等式可以在字段子集上定义(如上所述)。
答案 1 :(得分:1)
您可以将Sublime Text的查找和替换功能与以下正则表达式一起使用:
\A(?1)*?((^.*$\n){5})(?1)*?\K\1+
(即无替换)
这将找到文档中稍后存在的5行的块,并删除那5行(以及与其紧邻的任何行)的重复/第二次出现,留下其他行(即原始的5行是重复,所有其他线条都未触及。
不幸的是,由于正则表达式的性质,您需要多次执行此操作才能删除所有重复项。可能更容易继续调用"替换"而不是"全部替换"并且每次都必须重新打开面板。 (不知怎的,\K
按预期工作,despite a report of it not working with "Replace".)
答案 2 :(得分:1)
此处awk
+ sed
方法可以满足您的要求。
$ sed '0~5 s/$/\n/g' file | awk -v RS= '!($0 in a){a[$0];print}'
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
Valeria
Where Sasha live?
St. Petersburg
Kazan
Novgorod
Chistopol
Another question.
Sasha
Kazan
Chistopol
Katya
答案 3 :(得分:1)
请在下面找到Windows Power Shell的代码。代码不是以任何方式优化的。请将以下代码中的test.txt编辑到文件中,并确保工作目录是tha。输出是一个csv文件,您可以按顺序打开excel排序并删除第一列以删除索引。我不知道为什么这些索引来了,如何摆脱它。这是我第一次使用Windows Power Shell,我找不到语法来声明一个固定大小的字符串数组。无可置疑。
$d=Get-Content test.txt
$chk=@{};
$tot=$d.Count
$unique=@{}
$g=0;
$isunique=1;
for($i=0;$i -lt $tot){$isunique=1;
$chk[0]=$d[$i]
$chk[1]=$d[$i+1]
$chk[2]=$d[$i+2]
$chk[3]=$d[$i+3]
$chk[4]=$d[$i+4]
$i=$i+5
for($j=0;$j -lt $unique.count){
if($unique[$j] -eq $chk[0]){
if($unique[$j+1] -eq $chk[1]){
if($unique[$j+2] -eq $chk[2]){
if($unique[$j+3] -eq $chk[3]){
if($unique[$j+4] -eq $chk[4]){
$isunique=0
break
}
}
}
}
}
$j=$j+5
}
if ($isunique){
$unique[$g]=$chk[0]
$unique[$g+1]=$chk[1]
$unique[$g+2]=$chk[2]
$unique[$g+3]=$chk[3]
$unique[$g+4]=$chk[4]
$g=$g+5;
}
}
$unique | out-file test2.csv
![截图] http://imgur.com/a/ZP9T5
有Power Shell经验的人请优化代码。我试过.Contains .Add等,但没有得到理想的结果。希望它有所帮助。
答案 4 :(得分:1)
您的要求不清楚如何处理5行的重叠块,如何处理输入末尾少于5行的块,以及各种其他边缘情况,所以这里是识别块的一种方法5个(或更少)重复的行:
$ cat tst.awk
{
for (i=1; i<=5; i++) {
blockNr = NR - i + 1
if ( blockNr > 0 ) {
blocks[blockNr] = (blockNr in blocks ? blocks[blockNr] RS : "") $0
}
}
}
END {
for (blockNr=1; blockNr in blocks; blockNr++) {
block = blocks[blockNr]
print "----------- Block", blockNr, (seen[block]++ ? "***** DUP *****" : "ORIG")
print block
}
}
$ awk -f tst.awk file
----------- Block 1 ORIG
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
Valeria
----------- Block 2 ORIG
Sasha
Kristina
Katya
Valeria
Where Sasha live?
----------- Block 3 ORIG
Kristina
Katya
Valeria
Where Sasha live?
St. Petersburg
----------- Block 4 ORIG
Katya
Valeria
Where Sasha live?
St. Petersburg
Kazan
----------- Block 5 ORIG
Valeria
Where Sasha live?
St. Petersburg
Kazan
Novgorod
----------- Block 6 ORIG
Where Sasha live?
St. Petersburg
Kazan
Novgorod
Chistopol
----------- Block 7 ORIG
St. Petersburg
Kazan
Novgorod
Chistopol
Who is the greatest Goddess of the world?
----------- Block 8 ORIG
Kazan
Novgorod
Chistopol
Who is the greatest Goddess of the world?
Sasha
----------- Block 9 ORIG
Novgorod
Chistopol
Who is the greatest Goddess of the world?
Sasha
Kristina
----------- Block 10 ORIG
Chistopol
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
----------- Block 11 ***** DUP *****
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
Valeria
----------- Block 12 ORIG
Sasha
Kristina
Katya
Valeria
Another question.
----------- Block 13 ORIG
Kristina
Katya
Valeria
Another question.
Sasha
----------- Block 14 ORIG
Katya
Valeria
Another question.
Sasha
Kazan
----------- Block 15 ORIG
Valeria
Another question.
Sasha
Kazan
Chistopol
----------- Block 16 ORIG
Another question.
Sasha
Kazan
Chistopol
Katya
----------- Block 17 ORIG
Sasha
Kazan
Chistopol
Katya
Where Sasha live?
----------- Block 18 ORIG
Kazan
Chistopol
Katya
Where Sasha live?
St. Petersburg
----------- Block 19 ORIG
Chistopol
Katya
Where Sasha live?
St. Petersburg
Kazan
----------- Block 20 ORIG
Katya
Where Sasha live?
St. Petersburg
Kazan
Novgorod
----------- Block 21 ***** DUP *****
Where Sasha live?
St. Petersburg
Kazan
Novgorod
Chistopol
----------- Block 22 ORIG
St. Petersburg
Kazan
Novgorod
Chistopol
----------- Block 23 ORIG
Kazan
Novgorod
Chistopol
----------- Block 24 ORIG
Novgorod
Chistopol
----------- Block 25 ORIG
Chistopol
你可以在此基础上建立:
split(block,lines,RS)
)和