Question

如果我们有一个输入文件：input.csv

cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,5555,6666, IC50 ,>,150,uM,1334,1331,Ki,,10,uM,>,15,-2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

我们希望将这个input.csv分成2个文件，这样我们就可以执行以下步骤：＆＃34;如果$ 2相同，那么行上的平均值是相同的，其中最大减去最小值为$ 17＆lt; = 1＆＃34 ;

＆＃34;如果$ 2相同且最大减去$ 17＆lt; = 1＆＃34; min，则将其放入1个文件

注意：如果本身有一个唯一的$ 2，我们希望将其保留在此处（以cpd-6666666为例）
注：cpd-1111（$ 17 max-min）= -1 - （ - 1.3）= 0.3＆lt; 1

outputfile1.csv

cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

＆＃34;如果$ 2相同且$ 17的最大减去分数＆gt; 1＆＃34;，将其放入另一个文件

outfile2.csv（其中max＆amp; min in $ 17 = -1 - （ - 3）= 2＆gt; 1）

cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,5555,6666, IC50 ,>,150,uM,1334,1331,Ki,,10,uM,>,15,-2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

以下是从以下链接修改的尝试

awk/bash remove lines with an unique id and keep the lines that has the max/min value in a column under the same ID

#!/usr/bin/awk -f

BEGIN { FS="," }

NR==1 {print; next}

{
  a[$2,$17]=$0

  h=high[$2]
  high[$2]=$17>h || h=="" ? $17 : h

  m=mid[$2]
  mid[$2]=l<$17<h || m=="" ? $17 : m

  l=low[$2]
  low[$2]=$17<l || l=="" ? $17 : l
}

END {
  for(i in high) {
    if(high[i]-low[i]<=1) {
      print a[i,high[i]]
      print a[[i,mid[i]]
      print a[i,low[i]]
    }
  }
}

输出：

cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

由于未知原因，此脚本无法正确打印中间范围值。我可以知道是否有任何大师有评论/解决方案？

Answer 1

看看这个，这是一个处理每个组的例子，因为它的ID改变了：

#!/usr/bin/awk -f

BEGIN {FS=","; f1="a"; f2="b"}

FNR==1 { print $0 > f1; print $0 > f2; next }

$2!=last_id && FNR > 2 { handleBlock() }

{ a[++cnt]=$0; m[cnt]=$17; last_id=$2 }

END { handleBlock() }

function handleBlock() {
  if( m[1]-m[cnt]<=1 ) fname = f1
  else fname = f2
  for( i=1;i<=cnt;i++ ) { print a[i] > fname }
  cnt=0
}

它是一个可执行的awk文件。将其放入名为awko和chmod +x awko的文件中时，对于名为＆＃34; data＆＃34;的输入文件，可以像awko data一样运行。

我为另一个问题编写的脚本是基于我假设文件元素的输入顺序未知 - 其中$2字段可以是任何顺序，只有最小值和最大值很重要。在这个问题中，OP希望根据最小/最大值将与$2字段相关的所有行发送到一个或另一个文件。

此问题的输入文件具有此脚本所依赖的以下属性：

标题位于第一行
$2字段已分组
最大值是小组的第一个元素
最小值是组的最后一个值

如果资源列表按资源ID排序，则用于最低限度加载数据的一种常用算法是仅在资源ID更改时加载它。这里可以对处理分组条目进行相同的操作。举个例子：

a
a
a
b <- this is a good place to process all the prior "a" entries
b
c <- process "b" entries here
c
EOF <- the end of the file.  process the last group ( the "c" entries here )

考虑到这一点，这里是脚本的分解：

在FS块中设置BEGIN和一些输出文件名（＆＃34; a＆＃34;＆＃34; b＆＃34;用于我的测试）
第一行是标题 - 将其放在每个文件f1和f2中。
如果$2 != last_id，请调用handleBlock()函数进行处理。
将整行存储在数组a，$17数组m中，然后设置last_id=$2（数组名称太可怕了）。
cnt变量表示每组中有多少条目（我称之为块）
handleBlock()只会在$2 ID更改时或在文件末尾调用，以捕获END块中的最后一个组。
handleBlock() tests the OP's condition using m ( max is m [1] and min is m[cnt] ) to determine the output file name and then prints all elements from a`到所选的文件名。

如果＆＃34; $ 2相同且最大和最小值＆lt; = 1＆＃34;则awk分开的行和＆＃34; $ 2是相同的，最大值和最小值＆lt; 1＆＃34;

1 个答案: