我正在尝试过滤掉特定列中具有非唯一字符串的行,同时仅保留其他2列中具有最小值的行,当然还有那些没有重复行的行。请参阅我的示例表:
Col1 Col2 Col3 Col4 Col5 Col6 Col7
blah blah 1 blah blah 1 BBBB
blah blah 0 blah blah 3 AAAA
blah blah 1 blah blah 3 BBBB
blah blah 2 blah blah 0 AAAA
blah blah 0 blah blah 0 AAAA
blah blah 8 blah blah 3 CCCC
Col1,Col2,Col4和Col5并不重要,只能复制。如果Col7中的字符串多次出现,在所有出现的情况下,我只想打印在Col3中最低值的行,然后是如果存在平局,Col6中最低值。最后,我想添加一个新列,表示“唯一”或“多个”指定是否存在重复。
我想要的输出是这样的:
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8
blah blah 0 blah blah 0 AAAA multi
blah blah 1 blah blah 1 BBBB multi
blah blah 8 blah blah 3 CCCC unique
到目前为止,我用awk努力了。我可以找到所有带有重复字符串的行,并将它们打印成一行,但我不知道在打印之前如何过滤它。
awk '{dup[$7]=dup[$7] ? dup[$7] " duplicate of " $1 : $1} END {for (x in dup) print dup[x], x}'
任何帮助都会非常感激,awk中的解决方案(请说明请)会更受欢迎,因为我试图更好地理解它。
编辑以便更好地理解。
答案 0 :(得分:0)
我们假设我们有以下Input_file,其中我举例说明第3列有一些联系,第6列的价值低于第3列。
<强> INPUT_FILE:强>
cat Input_file
Col1 Col2 Col3 Col4 Col5 Col6 Col7
blah blah 0 blah blah -3 AAAA
blah blah 1 blah blah 3 BBBB
blah blah 2 blah blah -9 AAAA
blah blah 0 blah blah 0 AAAA
blah blah 1 blah blah 1 BBBB
blah blah 8 blah blah 3 CCCC
现在,以下是相同的代码。
<强>代码:强>
awk '
FNR>1 && FNR==NR{
a[$NF]=a[$NF]>$3?$3:a[$NF]?a[$NF]:$3;
c[$NF]++;
b[$NF,$3]++;
e[$3]=$3
d[$NF]=d[$NF]>$6?$6:d[$NF]?d[$NF]:$6;
next
}
FNR==1 && FNR!=NR{
print $0,"Col8";
for(i in b){
val=i
gsub(/[0-9]+/,"",val)
if(b[i]>1 && a[val]==e[val]){
a[val]=a[val]<d[val]?a[val]:d[val]
}}
next
}
($NF in a) && ($6==a[$NF]||$3==a[$NF]){
printf("%s %s\n",$0,c[$NF]>1?"Multi":"Unique");
delete a[$NF]
}
' SUBSEP="" Input_file Input_file
执行代码:
./script.ksh
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8
blah blah 1 blah blah 3 BBBB Multi
blah blah 2 blah blah -9 AAAA Multi
blah blah 8 blah blah 3 CCCC Unique
答案 1 :(得分:0)
排序 + uniq + sed 技巧:
echo "$(head -1 file) Col8" && \
sort -k7 -k3,3 -k6,6 <(tail -n +2 file) | uniq -cf6 \
| sed -E 's/^ *1 (.*)/\1 unique/; s/^ *([2-9]|[0-9]{2,}) (.*)/\2 multi/'
输出:
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8
blah blah 0 blah blah 0 AAAA multi
blah blah 1 blah blah 1 BBBB multi
blah blah 8 blah blah 3 CCCC unique
<强> ---------- 强>
GNU datamash + awkBonus 解决方案:
datamash -WHfs -g7 count 7 min 6 min 3 <file \
| awk 'NR==1{ $8="Col8" }NR>1{ $3=$10; $6=$9; $8=($8>1)?"multi":"unique" }{$9=$10=""}1' \
| column -t
答案 2 :(得分:0)
awk
救援!在其他工具的帮助下
$ head -1 file && sed 1d file |
sort -k7 -k3,3n -k6,6n |
uniq -c -f6 |
awk '!a[$NF]++{c=$1; gsub(" +"$1" +",""); print $0,c==1?"uniq":"multi"}'
Col1 Col2 Col3 Col4 Col5 Col6 Col7
blah blah 0 blah blah 0 AAAA multi
blah blah 1 blah blah 1 BBBB multi
blah blah 8 blah blah 3 CCCC uniq
当然如果你的标题不存在,你可以摆脱第一部分
答案 3 :(得分:0)
使用awk
<强>一衬垫强>
awk 'FNR==1{print $0,"Col8";next}function cp(){a[$7]=$0;b[$7]=$3;c[$7]=$6}{f=$7 in a; d[$7]++}!f{o[++i]=$7; cp(); next}f && b[$7]>$3{cp();next}f && b[$7]==$3 && $6<c[$7]{cp();next}END{for(i=1; i in o; i++)print a[o[i]],(d[o[i]]>1?"multi":"unique")}' file
<强>解释强>
awk '
# if first line was read then print current record and extra field as header
# go to next line
FNR==1{
print $0,"Col8";
next
}
# function which will be used frequently to store data
function cp(){
a[$7]=$0;
b[$7]=$3;
c[$7]=$6;
}
# variable f holds boolean status whether array a has key of field7 value
# variable f will be used frequently
{
f=$7 in a;
d[$7]++
}
# if key does not exist in array a then
!f{
# store order
# copy data
# go to next line
o[++i]=$7;
cp();
next
}
# if f is true and 3rd field value of previously stored data
# is greater than current record field3 data then
# we got smaller value lets save it
# and go to next line
f && b[$7]>$3{
cp();
next
}
# if variable f is true and field3 value of previously stored data
# is equal to current record field3 its tie,
# and check whether 6th field value is lesser than previously stored value
# then we got smaller value from field6 of current row/record/line
# copy data
# go to next line
f && b[$7]==$3 && $6<c[$7]{
cp();
next
}
# end block
# loop through array o,
# print value from array a, where index being o[i]
# d[o[i]] holds count of occurrence of field7
# if its greater than 1 then its multi otherwise unique
END{
for(i=1; i in o; i++)
print a[o[i]],(d[o[i]]>1?"multi":"unique")
}
' file
<强>输入强>
$ cat file
Col1 Col2 Col3 Col4 Col5 Col6 Col7
blah blah 0 blah blah 3 AAAA
blah blah 1 blah blah 3 BBBB
blah blah 2 blah blah 0 AAAA
blah blah 0 blah blah 0 AAAA
blah blah 1 blah blah 1 BBBB
blah blah 8 blah blah 3 CCCC
执行
$ awk 'FNR==1{print $0,"Col8";next}function cp(){a[$7]=$0;b[$7]=$3;c[$7]=$6;{f=$7 in a; d[$7]++}!f{o[++i]=$7;cp();next}f && b[$7]>$3{cp();next}f && b[$7]==$3 && $6<c[$7]{cp();next}END{for(i=1; i in o; i++)print a[o[i]],(d[o[i]]>1?"multi":"unique")}' file
<强>输出强>
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8
blah blah 0 blah blah 0 AAAA multi
blah blah 1 blah blah 1 BBBB multi
blah blah 8 blah blah 3 CCCC unique