Question

我们在这里有一个输入：

cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,5555,6666, IC50 ,>,150,uM,1334,1331,Ki,,10,uM,>,15,-2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

我们想将这个input.csv分成2个文件

如果$ 2相同且最大减去$ 17＆lt; = 1“，则平均$ 17并将其放入”文件a“。

如果$ 2相同且最大减去$ 17分钟＆gt; 1“，平均17美元，并将其放入”文件b“。

注意：如果本身有一个唯一的$ 2，我们希望将其保留在此处（以cpd-6666666为例）

注意：cpd-1111（$ 17 max-min）= -1 - （ - 1.3）= 0.3＆lt; 1

a：其中（$ 17 max-min）＆lt; = 1。新的17美元的cpd-1111（$ 2）是（-1，-1.1，-1.2，-1.3）= -1.15

的平均值

cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.15,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

b：其中（$ 17 max-min）＆gt; 1。新的$ 17 in cpd-7788990（$ 2）是（-1，-2，-3）= -2

的平均值

cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

这是尝试可以将输入分成a和b但尚未完成平均值。

#!/usr/bin/awk -f

BEGIN {FS=","; f1="a"; f2="b"}

FNR==1 { print $0 > f1; print $0 > f2; next }

$2!=last_id && FNR > 2 { handleBlock() }

{ a[++cnt]=$0; m[cnt]=$17; last_id=$2 }

END { handleBlock() }

function handleBlock() {

if( m[1]-m[cnt]<=1 ) fname = f1

else fname = f2

for( i=1;i<=cnt;i++ ) { print a[i] > fname }  

cnt=0
}

我可以知道是否还有a和b的平均值？感谢。

Answer 1

您可以通过更改handleBlock()来获取输出文件中的平均值，如下所示：

function handleBlock() {
  if( m[1]-m[cnt]<=1 ) fname = f1
  else fname = f2
    # compute the sum of the $17 fields for the group
  for( i=1;i<=cnt;i++ ) { sum+=m[i] }
    # compute the average
  avg = cnt > 0 ? sum/cnt : sum
    # use the max line for the output, split into an output array: oarr
  fcnt = split( a[1], oarr )
    # modify the 17th field of the output array
  oarr[17]=avg
    # write the updated array to the desired file one field at a time
  for( i=1;i<=fcnt;i++ ) {
    printf( "%s%s", oarr[i], i==fcnt ? "\n" : FS ) > fname
  }
  cnt=0; sum=0
}

检查here以获取有关原始脚本的评论。

awk基于$ 2和$ 17分开的行，并且平均17美元

1 个答案: