我们在这里有一个输入:
cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,5555,6666, IC50 ,>,150,uM,1334,1331,Ki,,10,uM,>,15,-2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
我们想将这个input.csv分成2个文件
如果$ 2相同且最大减去$ 17< = 1“,则平均$ 17并将其放入”文件a“。
如果$ 2相同且最大减去$ 17分钟> 1“,平均17美元,并将其放入”文件b“。
注意:如果本身有一个唯一的$ 2,我们希望将其保留在此处(以cpd-6666666为例)
注意:cpd-1111($ 17 max-min)= -1 - ( - 1.3)= 0.3< 1
a:其中($ 17 max-min)< = 1。新的17美元的cpd-1111($ 2)是(-1,-1.1,-1.2,-1.3)= -1.15
的平均值cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.15,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
b:其中($ 17 max-min)> 1。新的$ 17 in cpd-7788990($ 2)是(-1,-2,-3)= -2
的平均值cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
这是尝试可以将输入分成a和b但尚未完成平均值。
#!/usr/bin/awk -f
BEGIN {FS=","; f1="a"; f2="b"}
FNR==1 { print $0 > f1; print $0 > f2; next }
$2!=last_id && FNR > 2 { handleBlock() }
{ a[++cnt]=$0; m[cnt]=$17; last_id=$2 }
END { handleBlock() }
function handleBlock() {
if( m[1]-m[cnt]<=1 ) fname = f1
else fname = f2
for( i=1;i<=cnt;i++ ) { print a[i] > fname }
cnt=0
}
我可以知道是否还有a和b的平均值?感谢。
答案 0 :(得分:1)
您可以通过更改handleBlock()
来获取输出文件中的平均值,如下所示:
function handleBlock() {
if( m[1]-m[cnt]<=1 ) fname = f1
else fname = f2
# compute the sum of the $17 fields for the group
for( i=1;i<=cnt;i++ ) { sum+=m[i] }
# compute the average
avg = cnt > 0 ? sum/cnt : sum
# use the max line for the output, split into an output array: oarr
fcnt = split( a[1], oarr )
# modify the 17th field of the output array
oarr[17]=avg
# write the updated array to the desired file one field at a time
for( i=1;i<=fcnt;i++ ) {
printf( "%s%s", oarr[i], i==fcnt ? "\n" : FS ) > fname
}
cnt=0; sum=0
}
检查here以获取有关原始脚本的评论。