通过该行的唯一部分获取行,并仅显示该唯一部分的第一个匹配项

时间:2013-07-18 20:42:12

标签: bash sorting cut

我正在尝试编写一个脚本来查看一行的一部分,执行sort -u或其他事情来查找唯一的事件,然后显示输出,按行的ORIGINAL顺序排序。换句话说,只会显示该行部分的第一次出现。

我设法使用cut,但我的输出只显示数据的剪切部分。我怎么能这样做才能获得整条线?

这是我到目前为止所得到的:

cut -d, -f6 infile.txt | cut -c4-11 | grep -n . | sort -t: -k2,2 -u | sort -t: -k1n,1 | cut -d: -f2-

我知道数据在一个会破坏此脚本的地方没有额外的:,。但这只会输出唯一的数据。我怎样才能获得整条生产线?我宁愿远离perl,但是awk还可以(虽然我不太清楚)。

示例:

如果输入文件是这样的(注意,ABCDEFGH不是真的,我只是把它放在那里来说明我的意思):

A....,....,...........,.....,....,...20130718......,.........,...........,......
B....,....,...........,.....,....,...20130714......,.........,...........,......
C....,....,...........,.....,....,...20130718......,.........,...........,......
D....,....,...........,.....,....,...20130719......,.........,...........,......
E....,....,...........,.....,....,...20130713......,.........,...........,......
F....,....,...........,.....,....,...20130714......,.........,...........,......
G....,....,...........,.....,....,...20130630......,.........,...........,......
H....,....,...........,.....,....,...20130718......,.........,...........,......

我的节目输出:

20130718
20130714
20130719
20130713
20130630

我想看看:

A....,....,...........,.....,....,...20130718......,.........,...........,......
B....,....,...........,.....,....,...20130714......,.........,...........,......
D....,....,...........,.....,....,...20130719......,.........,...........,......
E....,....,...........,.....,....,...20130713......,.........,...........,......
G....,....,...........,.....,....,...20130630......,.........,...........,......

1 个答案:

答案 0 :(得分:5)

是的,awk是您最好的选择。这是一个神秘的例子:

awk -F, '!seen[substr($6,4,8)]++' infile.txt

说明:

options:
  -F,              set the field separator to ,

condition:
  substr($6,4,8)   up to 8 characters starting at the fourth character
                   of the sixth field
  seen[...]++      seen is an associative array (dictionary). Increment the
                   value associated with ..., and return the old value
  !seen[...]++     if there was no old value, perform the action


action:
  There is no action, only a condition, so the default action is
  performed if the test succeeds. The default action is to print
  the line. So the  line will be printed if the relevant characters of
  the sixth field haven't yet been seen.

测试:

$ awk -F, '!seen[substr($6,4,8)]++' <<EOF
> A....,....,...........,.....,....,...20130718......,.........,...........,......
> B....,....,...........,.....,....,...20130714......,.........,...........,......
> C....,....,...........,.....,....,...20130718......,.........,...........,......
> D....,....,...........,.....,....,...20130719......,.........,...........,......
> E....,....,...........,.....,....,...20130713......,.........,...........,......
> F....,....,...........,.....,....,...20130714......,.........,...........,......
> G....,....,...........,.....,....,...20130630......,.........,...........,......
> H....,....,...........,.....,....,...20130718......,.........,...........,......
> EOF
A....,....,...........,.....,....,...20130718......,.........,...........,......
B....,....,...........,.....,....,...20130714......,.........,...........,......
D....,....,...........,.....,....,...20130719......,.........,...........,......
E....,....,...........,.....,....,...20130713......,.........,...........,......
G....,....,...........,.....,....,...20130630......,.........,...........,......
$