如果我们有一个输入文件:input.csv
cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,5555,6666, IC50 ,>,150,uM,1334,1331,Ki,,10,uM,>,15,-2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
我们希望将这个input.csv分成2个文件,这样我们就可以执行以下步骤:"如果$ 2相同,那么行上的平均值是相同的,其中最大减去最小值为$ 17< = 1&#34 ;
"如果$ 2相同且最大减去$ 17< = 1" min,则将其放入1个文件
注意:如果本身有一个唯一的$ 2,我们希望将其保留在此处(以cpd-6666666为例)
注:cpd-1111($ 17 max-min)= -1 - ( - 1.3)= 0.3< 1
outputfile1.csv
cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
"如果$ 2相同且$ 17的最大减去分数> 1",将其放入另一个文件
outfile2.csv(其中max& min in $ 17 = -1 - ( - 3)= 2> 1)
cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,5555,6666, IC50 ,>,150,uM,1334,1331,Ki,,10,uM,>,15,-2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
以下是从以下链接修改的尝试
#!/usr/bin/awk -f
BEGIN { FS="," }
NR==1 {print; next}
{
a[$2,$17]=$0
h=high[$2]
high[$2]=$17>h || h=="" ? $17 : h
m=mid[$2]
mid[$2]=l<$17<h || m=="" ? $17 : m
l=low[$2]
low[$2]=$17<l || l=="" ? $17 : l
}
END {
for(i in high) {
if(high[i]-low[i]<=1) {
print a[i,high[i]]
print a[[i,mid[i]]
print a[i,low[i]]
}
}
}
输出:
cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
由于未知原因,此脚本无法正确打印中间范围值。我可以知道是否有任何大师有评论/解决方案?
答案 0 :(得分:2)
看看这个,这是一个处理每个组的例子,因为它的ID改变了:
#!/usr/bin/awk -f
BEGIN {FS=","; f1="a"; f2="b"}
FNR==1 { print $0 > f1; print $0 > f2; next }
$2!=last_id && FNR > 2 { handleBlock() }
{ a[++cnt]=$0; m[cnt]=$17; last_id=$2 }
END { handleBlock() }
function handleBlock() {
if( m[1]-m[cnt]<=1 ) fname = f1
else fname = f2
for( i=1;i<=cnt;i++ ) { print a[i] > fname }
cnt=0
}
它是一个可执行的awk文件。将其放入名为awko
和chmod +x awko
的文件中时,对于名为&#34; data&#34;的输入文件,可以像awko data
一样运行。
我为另一个问题编写的脚本是基于我假设文件元素的输入顺序未知 - 其中$2
字段可以是任何顺序,只有最小值和最大值很重要。在这个问题中,OP希望根据最小/最大值将与$2
字段相关的所有行发送到一个或另一个文件。
此问题的输入文件具有此脚本所依赖的以下属性:
$2
字段已分组如果资源列表按资源ID排序,则用于最低限度加载数据的一种常用算法是仅在资源ID更改时加载它。这里可以对处理分组条目进行相同的操作。举个例子:
a
a
a
b <- this is a good place to process all the prior "a" entries
b
c <- process "b" entries here
c
EOF <- the end of the file. process the last group ( the "c" entries here )
考虑到这一点,这里是脚本的分解:
FS
块中设置BEGIN
和一些输出文件名(&#34; a&#34;&#34; b&#34;用于我的测试)f1
和f2
中。$2 != last_id
,请调用handleBlock()
函数进行处理。a
,$17
数组m
中,然后设置last_id=$2
(数组名称太可怕了)。cnt
变量表示每组中有多少条目(我称之为块)handleBlock()
只会在$2
ID更改时或在文件末尾调用,以捕获END
块中的最后一个组。handleBlock() tests the OP's condition using
m ( max is
m [1] and min is m[cnt] ) to determine the output file name and then prints all elements from
a`到所选的文件名。