在bash中的特定单词后按编号对文件进行排序

时间:2017-07-27 11:42:34

标签: bash shell unix

我有一个大文件(超过1000行),我需要按一些标准对其进行排序。 文件包含如下行:

bla bla bla took 536ms. {"uniqueId":"ygfwyagf","duration":536} []
bla  took 531ms. {"uniqueId":"wdagweg","duration":531} []
[2017-07-26 11:34:04.346533] wgwqegwqeg took 47ms. {qwgwqgce":"local","duration":47} []
[2017-07-2 [bla] Aocal took 41ms. {"uniagwrqgwqrwqg ation":41} []
[2017-07-26 1wergwg  local took 39ms. {"uniqueId"wetgwgweqg gg}

需要在“take”之后用数字对它们进行排序 使用awk我可以通过以下方式对它们进行排序:awk '{for(i=1;i<=NF;i++) if ($i=="took") print $(i+1)}' test | sort -h 但对于输出,我需要从所有行,只需要排序而不会丢失任何东西。不幸的是,mss不在同一列上(很容易)。

如果比原生bash解决方案更好(更快/更简单/更正确),将接受需要调用另一个解释器(perl,python等)的解决方案。

3 个答案:

答案 0 :(得分:3)

执行此操作的简单方法是将要搜索的数据提取到列中,对其进行排序,然后在另一个管道元素中删除该列。

因此,作为一个直接的步骤:

gawk 'match($0, /took ([[:digit:]]+)/, m) {printf("%s\t%s\n", m[1], $0)}'

这将使您的信息流看起来像:

536 bla bla bla took 536ms. {"uniqueId":"ygfwyagf","duration":536} []
531 bla  took 531ms. {"uniqueId":"wdagweg","duration":531} []
47  [2017-07-26 11:34:04.346533] wgwqegwqeg took 47ms. {qwgwqgce":"local","duration":47} []
41  [2017-07-2 [bla] Aocal took 41ms. {"uniagwrqgwqrwqg ation":41} []
39  [2017-07-26 1wergwg  local took 39ms. {"uniqueId"wetgwgweqg gg}

...此时您可以通过sort -n传递它以对开头的数字进行排序,然后对去除该主要值的管道元素进行排序:

gawk 'match($0, /took ([[:digit:]]+)/, m) {printf("%s\t%s\n", m[1], $0)}' \
 | sort -n | cut -d $'\t' -f 2-

......我们有输出:

[2017-07-26 1wergwg  local took 39ms. {"uniqueId"wetgwgweqg gg}
[2017-07-2 [bla] Aocal took 41ms. {"uniagwrqgwqrwqg ation":41} []
[2017-07-26 11:34:04.346533] wgwqegwqeg took 47ms. {qwgwqgce":"local","duration":47} []
bla  took 531ms. {"uniqueId":"wdagweg","duration":531} []
bla bla bla took 536ms. {"uniqueId":"ygfwyagf","duration":536} []

答案 1 :(得分:2)

使用Perl,你可以写

perl -e '
    while (<>) {
        if (/took (\d+)/) {
            push @{$lines{$1}}, $_;
        } 
    } 
    for $num (sort {$a <=> $b} keys %lines) {
        print join("", @{$lines{$num}});
    }
' file

或,作为线路噪音

perl -lnE'/took (\d+)/&&push@{$l{$1}},$_}END{say@{$l{$_}}for sort{$a<=>$b}keys%l' file

答案 2 :(得分:1)

"ZR°p"

作为替代方案,更简洁的使用gawk的方法是使用时间戳作为数组的索引(tim),然后使用函数asorti将索引排序到另一个数组(tim1),tim1中的排序索引是然后用于提取数据。

输出:

for InItem in Input:
    if not any(AlItem in InItem for AlItem in alpha+digit+punct):