Question

File:
this is a paragraph
to find in another 
file

some stuff .. 

more stuff ... 

this is a paragraph
to find in another 
file

more stuff ... 

another paragraph 
to match

yet more stuff.. 

this is a paragraph
duplicate in this 
file

another paragraph 
to match 

this is a paragraph
duplicate in this 
file

yet more stuff..

this is a paragraph
to find in another
file

应该返回：

this is a paragraph
to find in another 
file

some stuff .. 

more stuff ... 

more stuff ... 

another paragraph 
to match

yet more stuff.. 

this is a paragraph
duplicate in this 
file

yet more stuff..

我找到了pcregrep -n -M，我知道我可以循环使用sed和这个命令搜索每个段落，但pcregrep不在每个系统上，所以如果可以避免这样做会很好。使用标准* nix的东西寻找优雅的东西，最好不要perl。

* 一些好的帖子和想法，但它们虽然在我发布的有限案例中没有正常工作，所以我调整了示例数据，以便您可以看到它是否会更有效

* 这是一个仅打印多行段落的sed one-liner：

sed -e '/./{H;$!d;}' -e 'x;/.*\n.*\n.*/!d' file

Answer 1

这主要做你想要的。唯一的问题（我知道offhand）是它将输入中的空白行的运行折叠成输出中的一个空行。

awk -v RS= '!x[$0]++{print; print ""}'

使用“如果RS设置为空字符串，则记录由空行分隔”这一事实。并为awk吞下的RS打印一个额外的空行。

编辑：纳入@ EdMorton的建议可以让你这样做。

awk -v RS= -v ORS='\n\n' '!seen[$0]++'

和awk -v RS= '!seen[$0]++{ORS=RT; print}' GNU awk保持段落之间的间距与输入一致（而不是折叠空行的运行）。

再次编辑：

这个版本似乎工作正常（使用GNU awk 3.1.7和更新版本，我不知道3.1.6），只有一个例外是它在文件末尾添加了一个空行。

awk -v RS= '{gsub(/[[:blank:]]+$/,""); gsub(/[[:blank:]]+\n/,"\n")} !seen[$0]++{ORS=RT;print}'

寻找一个单行程序从文件中删除重复的多行段落

1 个答案: