我正在尝试删除逗号分隔文件中的行,其中APPID相同且“类别”列属于同一类别。输入:
1,APPID,ID2,ID3,5,6,7,8,9,Category,
5002 , APP-1 ,,,,,,,, Cell ,
5002 , APP-1 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Enzyme ,
5002 , APP-3 ,,,,,,,, Cell ,
5002 , APP-3 ,,,,,,,, Biochemical ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Cell ,
理想输出:
1,APPID,3,4,5,6,7,8,9,Category ,
5002 , APP-2 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Enzyme ,
5002 , APP-3 ,,,,,,,, Cell ,
5002 , APP-3 ,,,,,,,, Biochemical ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Cell ,
“APP-1”已删除,因为它们的第2列相同且其“类别”列都是“单元格”。
保留“APP-2”因为它们的“类别”列中有一个“细胞”而另一个是“生物化学”。“APP-3”中的类似方案,其“类别”列包含异构类别。
(更新)“APP-4”被保留,因为它们的列包含异构类别。我们想继续重复“5002,APP-4 ......”,这将在下一个脚本中处理。此步骤正在快速删除数万个“类别”列中的同类数据点(如果它们的APPID相同),以便下一个脚本中的数组不会爆炸。
到目前为止的尝试似乎不起作用(参考此处:removal of redundant lines based on value in last column)
awk -F " ," '!a[$1,$2,$3,$4,$5,$6,$7,$8,$9]++' input
每个文件的流程文件大约为一百万行,总共需要处理大约400个文件。执行速度似乎在这里至关重要。任何古茹都可以开导吗?谢谢!
答案 0 :(得分:2)
def killDups(infilepath, outfilepath):
data = {}
with open(infilepath) as infile:
infile.readline()
for i,line in enumerate(infile):
line = line.strip()
cols = [col.strip() for col in line.split(',')]
appid, cat = cols[1], cols[-1]
if appid not in data:
data[appid] = {cat:i}
elif cat in data[appid]:
data[appid].pop(cat)
whitelist = set()
for k,v in data.items():
whitelist.update(v.values())
with open(infilepath) as infile, open(outfilepath, 'w') as outfile:
outfile.write(infile.readline())
for i,line in enumerate(infile):
if i in whitelist:
outfile.write(line)
答案 1 :(得分:2)
$ awk -F, '
{ key=$2 FS $(NF-1); nr2key[NR]=key; key2val[key]=$0; cnt[key]++ }
END {
for (i=1;i<=NR;i++) {
key=nr2key[i]
if (cnt[key] == 1) {
print key2val[key]
}
}
}
' file
1,APPID,ID2,ID3,5,6,7,8,9,Category,
5002 , APP-2 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Enzyme ,
5002 , APP-3 ,,,,,,,, Cell ,
5002 , APP-3 ,,,,,,,, Biochemical ,
答案 2 :(得分:1)
以下是awk
的另一种方式:
awk -F, '
!patt[$2,$(NF-1)]++ { lines[$2,$(NF-1)] = $0 }
END {
for (line in lines)
if (patt[line] == 1)
print lines[line]
}' file | sort -t, -nk1,2
1,APPID,ID2,ID3,5,6,7,8,9,Category,
5002 , APP-2 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Enzyme ,
5002 , APP-3 ,,,,,,,, Biochemical ,
5002 , APP-3 ,,,,,,,, Cell ,
patt
数组中,则将整行分配给相同键的行数组END
块中迭代行数组。如果模式数组的键计数为1,则打印该行。 注意:要使用vanilla awk
更优雅的方式,请参阅Ed Morton's解决方案。
如果你有GNU awk
那么(类似的逻辑,但使用内置的排序算法):
gawk -F, '
BEGIN { PROCINFO["sorted_in"] = "@ind_num_desc" }
!patt[$2,$(NF-1)]++ {
lines[$2,$(NF-1)] = $0
}
END {
for (line in lines)
if (patt[line] == 1)
print lines[line]
}' file
如果您可以使用perl
:
perl -F, -lane'
print and next if $.==1; # print the header
$key = "@F[1,-1]"; # form the key using two columns
$h{$key} or push @rec, $key; # if key is not in hash push to array (for order)
push @{$h{$key}}, $_ # create hash of arrays
}{ # In the END block ...
print @{$h{$_}} for grep { @{$h{$_}} == 1 } @rec # print line whose array count is 1
' file
1,APPID,ID2,ID3,5,6,7,8,9,Category,
5002 , APP-2 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Enzyme ,
5002 , APP-3 ,,,,,,,, Cell ,
5002 , APP-3 ,,,,,,,, Biochemical ,
perl -F, -lane'
print and next if $.==1;
$seen{$F[1],$F[-1]}++ or push @rec, [$F[1], $F[-1]];
push @{$h{$F[1]}{$F[-1]}}, $_
}{
for (@rec) {
next if keys %{$h{$_->[0]}} == 1;
print join "\n", @{$h{$_->[0]}{$_->[1]}};
}
' file
1,APPID,ID2,ID3,5,6,7,8,9,Category,
5002 , APP-2 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Enzyme ,
5002 , APP-3 ,,,,,,,, Cell ,
5002 , APP-3 ,,,,,,,, Biochemical ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Cell ,
答案 3 :(得分:1)
这是一个GNU Awk解决方案,其中包含具有整体异构值的密钥,这些密钥可能包含重复项,例如APP-4
中的重复项:
BEGIN {
FS=","
OFS=","
}
{
key[NR]=$2
count[$2]++
v=$(NF-1)
val[NR]=v
val_count[$2][v]++
line[NR]=$0
}
END {
for(i=1;i<=NR;i++) {
k=key[i]
j=val[i]
if(count[k] > 1) {
if(val_count[k][j] == count[k]) {
continue
}else{
print line[i]
}
}else{
print line[i]
}
}
}
您可以将其创建为Awk文件,并将其命名为hetero.awk
,并从shell运行脚本,如下所示:
gawk -f hetero.awk file
输出:
1,APPID,ID2,ID3,5,6,7,8,9,Category,
5002 , APP-2 ,,,,,,,, Cell ,
5002 , APP-2 ,,,,,,,, Enzyme ,
5002 , APP-3 ,,,,,,,, Cell ,
5002 , APP-3 ,,,,,,,, Biochemical ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Enzyme ,
5002 , APP-4 ,,,,,,,, Cell ,
或者,对于更脏的方法,您可以将以下内容放在shell脚本中:
gawk -F, -v OFS=, '{
key[NR]=$2
count[$2]++
v=$(NF-1)
val[NR]=v
val_count[$2][v]++
line[NR]=$0
}END{
for(i=1;i<=NR;i++) {
k=key[i]
j=val[i]
if(count[k] > 1) {
if(val_count[k][j] == count[k]) {
continue
}else{
print line[i]
}
}else{
print line[i]
}
}
}' file
作为一般惯例,我更喜欢在我的bash脚本中仅使用Awk一个衬里。
请注意,这会使用数组,这些数组不是awk变体中可用的功能,例如mawk。