Question

1。简言之

我有一个大文本文件（14MB）。我需要删除文件中的文本块，包含5个重复的行。

如果可以使用任何免费方法，那将是很好的。

我使用Windows，但Cygwin解决方案也不错。

2。设置

1。文件结构

I have a file test1.md。它由重复块组成。每个街区有10条线。文件结构（使用PCRE正则表达式）

Millionaire
\d{18}
QUESTION.*
.*
.*
.*
.*
.*
.*
.*
Millionaire
\d{18}
QUESTION.*
.*
.*
.*
.*
.*
.*
.*

test1.md除了10行块之外没有其他行和文本。它没有空白行和块数大于或小于10的行。

2。文件的示例内容

Millionaire
123456788763237476
QUESTION|2402394827049882049
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
AuthorOfQuestion
Millionaire
459385734954395394
QUESTION|9845495845948594999
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author
Millionaire
778845225202502505
QUESTION|984ACFBBADD8594999A
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
Millionaire
AuthorOfQuestion
Millionaire
903034225025025568
QUESTION|ABC121980850540445C
Another question.
Katya
Sasha
Kazan
Chistopol
Katya
Unknown author
Millionaire
450602938477581129
QUESTION|453636EE4534345AC5E
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author

从示例中可以看出，test1.md重复了7行块。例如，这些块是：

Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
AuthorOfQuestion

和

Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author

3。预期的行为

我需要删除所有重复块。在我的例子中，我需要得到：

Millionaire
123456788763237476
QUESTION|2402394827049882049
Who is the greatest Goddess of the world?
Sasha
Kristina
Sasha
Katya
Valeria
AuthorOfQuestion
Millionaire
459385734954395394
QUESTION|9845495845948594999
Where Sasha live?
Novgorod
St. Petersburg
Kazan
Novgorod
Chistopol
Another author
Millionaire
778845225202502505
QUESTION|984ACFBBADD8594999A
Millionaire
903034225025025568
QUESTION|ABC121980850540445C
Another question.
Katya
Sasha
Kazan
Chistopol
Katya
Unknown author
Millionaire
450602938477581129
QUESTION|453636EE4534345AC5E

如果 7行重复7行，这些行已在我的文件中使用，则删除了7行。
如果 1（也是2-4）行重复1行，已在我的文件中使用过，重复1行不会删除。在示例中，Sasha，Kazan，Chistopol和Katya重复，但这些字词不会删除。

4。没帮忙

谷歌搜索
我发现，Unix命令sort，sed和awk可以解决类似的任务，但我没有找到，我如何使用这些命令来解决我的任务。

5。不提供

请不要手动删除每个文本块。可能，我有几千个不同的重复文本块。手动删除所有重复项可能需要很长时间。

Answer 1

这是解决您问题的简单方法（如果您可以访问GNU sed，sort和uniq）：

sed 's/^Millionaire/\x0&/' file | sort -z -k4 | uniq -z -f3 | tr -d '\000'

按顺序解释一下：

因为所有块以单词/行Millionaire开头，我们可以使用它来将文件分成（可变长）块，方法是在NUL字符前加上Millionaire每个NUL;
然后我们对那些-z - 分隔的块（使用Millionaire标志）进行排序，但忽略前3个字段（在这种情况下为行：\d+，QUESTION|ID... ，-k），使用--key / 4选项，其中起始位置为字段uniq（在您的情况下为第4行），停止位置为结束块;
排序后，我们可以使用NUL过滤掉重复项，再次使用-z分隔符而不是换行符（-f），并忽略前3个字段（使用{{ 1}} / --skip-fields）;
最后，我们使用NUL删除tr分隔符。

通常，只要有将文件拆分为块的方法，删除这样的重复块的解决方案就应该有效。请注意，块等式可以在字段子集上定义（如上所述）。

Answer 2

您可以将Sublime Text的查找和替换功能与以下正则表达式一起使用：

替换内容：\A(?1)*?((^.*$\n){5})(?1)*?\K\1+
替换为：

（即无替换）

这将找到文档中稍后存在的5行的块，并删除那5行（以及与其紧邻的任何行）的重复/第二次出现，留下其他行（即原始的5行是重复，所有其他线条都未触及。

不幸的是，由于正则表达式的性质，您需要多次执行此操作才能删除所有重复项。可能更容易继续调用＆＃34;替换＆＃34;而不是＆＃34;全部替换＆＃34;并且每次都必须重新打开面板。（不知怎的，\K按预期工作，despite a report of it not working with "Replace".）

Answer 3

此处awk + sed方法可以满足您的要求。

$ sed '0~5 s/$/\n/g' file | awk -v RS= '!($0 in a){a[$0];print}'
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
Valeria
Where Sasha live?
St. Petersburg
Kazan
Novgorod
Chistopol
Another question.
Sasha
Kazan
Chistopol
Katya

Answer 4

请在下面找到Windows Power Shell的代码。代码不是以任何方式优化的。请将以下代码中的test.txt编辑到文件中，并确保工作目录是tha。输出是一个csv文件，您可以按顺序打开excel排序并删除第一列以删除索引。我不知道为什么这些索引来了，如何摆脱它。这是我第一次使用Windows Power Shell，我找不到语法来声明一个固定大小的字符串数组。无可置疑。

$d=Get-Content test.txt
$chk=@{};
$tot=$d.Count
$unique=@{}
$g=0;
$isunique=1;
for($i=0;$i -lt $tot){$isunique=1;
$chk[0]=$d[$i]

$chk[1]=$d[$i+1]

$chk[2]=$d[$i+2]

$chk[3]=$d[$i+3]

$chk[4]=$d[$i+4]

$i=$i+5

for($j=0;$j -lt $unique.count){
if($unique[$j] -eq $chk[0]){
if($unique[$j+1] -eq $chk[1]){

if($unique[$j+2] -eq $chk[2]){

if($unique[$j+3] -eq $chk[3]){

if($unique[$j+4] -eq $chk[4]){ 

$isunique=0
break
}
}
}
}
}
$j=$j+5

}



if ($isunique){
$unique[$g]=$chk[0] 

$unique[$g+1]=$chk[1] 
$unique[$g+2]=$chk[2] 
$unique[$g+3]=$chk[3] 
$unique[$g+4]=$chk[4] 
$g=$g+5;

}

}


$unique | out-file test2.csv

！[截图] http://imgur.com/a/ZP9T5

有Power Shell经验的人请优化代码。我试过.Contains .Add等，但没有得到理想的结果。希望它有所帮助。

Answer 5

您的要求不清楚如何处理5行的重叠块，如何处理输入末尾少于5行的块，以及各种其他边缘情况，所以这里是识别块的一种方法5个（或更少）重复的行：

$ cat tst.awk
{
    for (i=1; i<=5; i++) {
        blockNr = NR - i + 1
        if ( blockNr > 0 ) {
            blocks[blockNr] = (blockNr in blocks ? blocks[blockNr] RS : "") $0
        }
    }
}
END {
    for (blockNr=1; blockNr in blocks; blockNr++) {
        block = blocks[blockNr]
        print "----------- Block", blockNr, (seen[block]++ ? "***** DUP *****" : "ORIG")
        print block
    }
}

$ awk -f tst.awk file
----------- Block 1 ORIG
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
Valeria
----------- Block 2 ORIG
Sasha
Kristina
Katya
Valeria
Where Sasha live?
----------- Block 3 ORIG
Kristina
Katya
Valeria
Where Sasha live?
St. Petersburg
----------- Block 4 ORIG
Katya
Valeria
Where Sasha live?
St. Petersburg
Kazan
----------- Block 5 ORIG
Valeria
Where Sasha live?
St. Petersburg
Kazan
Novgorod
----------- Block 6 ORIG
Where Sasha live?
St. Petersburg
Kazan
Novgorod
Chistopol
----------- Block 7 ORIG
St. Petersburg
Kazan
Novgorod
Chistopol
Who is the greatest Goddess of the world?
----------- Block 8 ORIG
Kazan
Novgorod
Chistopol
Who is the greatest Goddess of the world?
Sasha
----------- Block 9 ORIG
Novgorod
Chistopol
Who is the greatest Goddess of the world?
Sasha
Kristina
----------- Block 10 ORIG
Chistopol
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
----------- Block 11 ***** DUP *****
Who is the greatest Goddess of the world?
Sasha
Kristina
Katya
Valeria
----------- Block 12 ORIG
Sasha
Kristina
Katya
Valeria
Another question.
----------- Block 13 ORIG
Kristina
Katya
Valeria
Another question.
Sasha
----------- Block 14 ORIG
Katya
Valeria
Another question.
Sasha
Kazan
----------- Block 15 ORIG
Valeria
Another question.
Sasha
Kazan
Chistopol
----------- Block 16 ORIG
Another question.
Sasha
Kazan
Chistopol
Katya
----------- Block 17 ORIG
Sasha
Kazan
Chistopol
Katya
Where Sasha live?
----------- Block 18 ORIG
Kazan
Chistopol
Katya
Where Sasha live?
St. Petersburg
----------- Block 19 ORIG
Chistopol
Katya
Where Sasha live?
St. Petersburg
Kazan
----------- Block 20 ORIG
Katya
Where Sasha live?
St. Petersburg
Kazan
Novgorod
----------- Block 21 ***** DUP *****
Where Sasha live?
St. Petersburg
Kazan
Novgorod
Chistopol
----------- Block 22 ORIG
St. Petersburg
Kazan
Novgorod
Chistopol
----------- Block 23 ORIG
Kazan
Novgorod
Chistopol
----------- Block 24 ORIG
Novgorod
Chistopol
----------- Block 25 ORIG
Chistopol

你可以在此基础上建立：

使用blockNr加上该块中的当前行号（提示：（split(block,lines,RS)）和
弄清楚如何处理未指定的要求。

删除文件中的n个重复行