Question

我使用awk处理文件以过滤特定感兴趣的行。使用生成的输出，我希望能够删除除了以相同字符串开头的最后一行之外的所有行。

以下是生成内容的示例：

this is a line
duplicate remove me
duplicate this should go too
another unrelated line
duplicate but keep me
example remove this line
example but keep this one
more unrelated text

第2行和第3行应该删除，因为它们以duplicate开头，第5行也是如此。因此，第5行应该保留，因为它是以duplicate开头的最后一行。

第6行也是如此，因为它以example开头，第7行也是如此。因此，第7行应保留，因为它是以example开头的最后一行。

鉴于上面的例子，我想产生以下输出：

this is a line
another unrelated line
duplicate but keep me
example but keep this one
more unrelated text

我怎么能实现这个目标？

我尝试了以下操作，但它无法正常工作：

awk -f initialProcessing.awk largeFile | awk '{currentMatch=$1; line=$0; getline; nextMatch=$1; if (currentMatch != nextMatch) {print line}}' -

Answer 1

为什么不从头到尾阅读文件并打印包含duplicate的第一行？这样你就不用担心打印的是什么了，抓住线等等。

tac file | awk '/duplicate/ {if (f) next; f=1}1' | tac

这会在第一次看到f时设置标记duplicate。从第二个时间开始，该标志使该行被跳过。

如果您想以最后一次打印每个第一个单词的方式制作此通用，请使用数组方法：

tac file | awk '!seen[$1]++' | tac

这会记录到目前为止出现的第一个单词。它们存储在数组seen[]中，因此通过说!seen[$1]++，我们会在第一次出现$1时将其设为True;从第二次开始，它评估为False并且不打印该行。

测试

$ tac a | awk '!seen[$1]++' | tac
this is a line
another unrelated line
duplicate but keep me
example but keep this one
more unrelated text

Answer 2

您可以使用（关联）数组始终保持最后一次出现：

awk '{last[$1]=$0;} END{for (i in last) print last[i];}' file

删除除了以相同字符串

2 个答案:

测试