仅在匹配行上执行uniq,同时忽略某些列

时间:2016-06-04 03:28:17

标签: regex perl awk uniq

假设我有一个如下所示的输入文件:

2016-06-03 21:00:14 > user1 has connected.
2016-06-03 21:00:14 > user1 has connected.
2016-06-03 21:00:15 > user1 has connected.
2016-06-03 21:00:22 > foobar disconnected.
2016-06-03 21:00:22 > foobar disconnected.
2016-06-03 21:00:29 > user2 has connected.
2016-06-03 21:00:29 > user2 has connected.
2016-06-03 21:00:29 > user2 has disconnected.
2016-06-03 21:00:30 > user2 has disconnected.
2016-06-03 21:00:30 > user2 has disconnected.

我可以删除所有重复的连续行,忽略uniq -f2 file.txt的前两列,但我正在寻找一种方法只删除其中包含has connected.的重复项,以便输出看起来像这样:

2016-06-03 21:00:14 > user1 has connected.
2016-06-03 21:00:22 > foobar disconnected.
2016-06-03 21:00:22 > foobar disconnected.
2016-06-03 21:00:29 > user2 has connected.
2016-06-03 21:00:29 > user2 has disconnected.
2016-06-03 21:00:30 > user2 has disconnected.
2016-06-03 21:00:30 > user2 has disconnected.

我想这可以通过匹配一个固定的字符串来实现("已连接。")但我也对一个可以使用正则表达式的命令感兴趣。

我查看了this question的答案,但无法修改命令,因此它们可以使用我的输入。

4 个答案:

答案 0 :(得分:1)

$ awk -F'>' '!(/has connected/ && seen[$2]++)' file
2016-06-03 21:00:14 > user1 has connected.
2016-06-03 21:00:22 > foobar disconnected.
2016-06-03 21:00:22 > foobar disconnected.
2016-06-03 21:00:29 > user2 has connected.
2016-06-03 21:00:29 > user2 has disconnected.
2016-06-03 21:00:30 > user2 has disconnected.
2016-06-03 21:00:30 > user2 has disconnected.

答案 1 :(得分:1)

一行Perl解决方案

perl -nE 'print unless /has connected/ && @s{/>\s+(.+)/}++' myfile.log

输出

2016-06-03 21:00:14 > user1 has connected.
2016-06-03 21:00:22 > foobar disconnected.
2016-06-03 21:00:22 > foobar disconnected.
2016-06-03 21:00:29 > user2 has connected.
2016-06-03 21:00:29 > user2 has disconnected.
2016-06-03 21:00:30 > user2 has disconnected.
2016-06-03 21:00:30 > user2 has disconnected.

请注意,故意使用 哈希切片 @s{/>\s+(.+)/}++ 。它通常是一个错误,但在这里它用于将正则表达式放在列表上下文


如果您想要Chris Charley wrote之类的可爱内容,只有在用户之前已断开连接时才会报告已连接,那么在单行中无法理解。这个脚本会为你做这个

如果您不熟悉Perl,那么要在文件上运行此功能,您应该将<DATA>更改为<>并运行此类程序

$ perl filter.pl myfile.log
use strict;
use warnings;

my %online;

while ( <DATA> ) {

    next unless my ($name, $op) = />\s+(.+)\s+(disconnected|has connected)\./;

    if ( $op eq 'disconnected' ) {
        delete $online{$name};
        print;
    }
    else {
        print unless $online{$name}++;
    }
}

__DATA__
2016-06-03 21:00:14 > user1 has connected.
2016-06-03 21:00:14 > user1 has connected.
2016-06-03 21:00:15 > user1 has connected.
2016-06-03 21:00:22 > foobar disconnected.
2016-06-03 21:00:22 > foobar disconnected.
2016-06-03 21:00:15 > user1 disconnected.
2016-06-03 21:00:29 > user2 has connected.
2016-06-03 21:00:29 > user2 has connected.
2016-06-03 21:00:29 > user2 has disconnected.
2016-06-03 21:00:14 > user1 has connected.
2016-06-03 21:00:30 > user2 has disconnected.
2016-06-03 21:00:30 > user2 has disconnected.

输出

2016-06-03 21:00:14 > user1 has connected.
2016-06-03 21:00:22 > foobar disconnected.
2016-06-03 21:00:22 > foobar disconnected.
2016-06-03 21:00:15 > user1 disconnected.
2016-06-03 21:00:29 > user2 has connected.
2016-06-03 21:00:29 > user2 has disconnected.
2016-06-03 21:00:14 > user1 has connected.
2016-06-03 21:00:30 > user2 has disconnected.
2016-06-03 21:00:30 > user2 has disconnected.

答案 2 :(得分:0)

用awk:

na.omit(DT[, names(DT) :=  .(Type[1L], shift(Cohort, type="lead")), cumsum(Type!="")])
#     Type Cohort
# 1:    A      1
# 2:    A      2
# 3:    A      3
# 4:    A      4
# 5:    B      5
# 6:    B      6
# 7:    B      7
# 8:    C      8
# 9:    C      9
#10:    C     10
#11:    C     11
#12:    C     12

检查数组中是否已存在某个值,或者如果字符串中已“断开连接”,则检查该值是否

awk -F">" '!($2 in a) || $2 ~ /disconnected/ {a[$2]=$2; print}' < file.txt

输出

!($2 in a) || $2 ~ /disconnected/ 

答案 3 :(得分:0)

我认为这个perl解决方案可能就是你想要的。我在数据中添加了更多行。

#!/usr/bin/perl
use strict;
use warnings;

my %seen;
while (<DATA>) {
    if (/ > (.+? connected)/) {
        print unless $seen{$1}++;
    }
    else {
        %seen = ();
        print;  
    }   
}

__DATA__
2016-06-03 21:00:14 > user1 has connected.
2016-06-03 21:00:14 > user1 has connected.
2016-06-03 21:00:15 > user1 has connected.
2016-06-03 21:00:22 > foobar disconnected.
2016-06-03 21:00:22 > foobar disconnected.
2016-06-03 21:00:29 > user2 has connected.
2016-06-03 21:00:29 > user2 has connected.
2016-06-03 21:00:29 > user2 has disconnected.
2016-06-03 21:00:30 > user2 has disconnected.
2016-06-03 21:00:30 > user2 has disconnected.
2016-06-03 21:00:31 > user1 has connected.
2016-06-03 21:00:31 > user1 has connected.
2016-06-03 21:00:34 > user1 has connected.
2016-06-03 21:00:50 > user2 has connected.
2016-06-03 21:00:51 > user2 has connected.

打印

2016-06-03 21:00:14 > user1 has connected.
2016-06-03 21:00:22 > foobar disconnected.
2016-06-03 21:00:22 > foobar disconnected.
2016-06-03 21:00:29 > user2 has connected.
2016-06-03 21:00:29 > user2 has disconnected.
2016-06-03 21:00:30 > user2 has disconnected.
2016-06-03 21:00:30 > user2 has disconnected.
2016-06-03 21:00:31 > user1 has connected.
2016-06-03 21:00:50 > user2 has connected.