Shell脚本使用循环,同时,grep和awk查找术语列表和相应的值

时间:2013-04-19 09:57:47

标签: shell unix awk grep

我正在搜索/尝试将源文件中的术语列表(Ensemble Gene ID)与目标rnaseq.gtf文件中的术语列表进行匹配。我想将匹配/ grep'd ENSEMBLE基因ID及其对应的RPKM1和RPKM2值打印到单独的输出文件中。

source_geneid.csv文件如下所示:

GO Genes ENSEMBLE Gene ID
AATF    ENSG00000108270
ADNP    ENSG00000101126

target_rnaseq.gtf文件:

chr17   gencodeV7   gene    35306175    35414170    0.669763    +   .   gene_id "ENSG00000108270.5"; transcript_ids "ENST00000225402.4,"; RPKM1 "7.81399"; RPKM2 "8.149"; iIDR "0.000";
chr20   gencodeV7   gene    49505585    49547750    0.862675    -   .   gene_id "ENSG00000101126.8"; transcript_ids "ENST00000371602.2,ENST00000349014.3,ENST00000396029.3,ENST00000396032.1,ENST00000534467.1,"; RPKM1 "12.0082"; RPKM2 "8.55263"; iIDR "0.000";

输出文件包含匹配的/ grep'd gene_id及其对应的RPKM1和RPMK2值:

ENSG00000108270.5 RPKM1 "7.81399"  RPKM2 "8.149"
ENSG00000101126.8 RPKM1 "12.0082" RPKM2 "8.55263"

我在命令行上完成了它:

grep -w "ENSG*" target_rnaseq.gtf| awk '{print $10,$13,$14,$15,$16}' >> output.txt

我也尝试过(感谢fedorqui)

while read line
do
  var=$(echo $line | awk '{print $2}')
while read line
do
  var=$(echo $line | awk '{print $2}')
  grep -w "$var" target_rnaseq.gtf | awk '{print $10,$13,$14,$15,$16}' >> output.txt
done < source_geneid.csv

但它打印出目标文件中的所有基因id。

1 个答案:

答案 0 :(得分:3)

target_rnaseq.gtf似乎格式正确,因此您可以轻松处理它以简化工作,例如获取您感兴趣的值非常简单:

$ awk 'NR>1{gsub(/ ?"/,"",$1);print $1,$3,$4}' FS=';' RS='gene_id' rnaseq
ENSG00000108270.5  RPKM1 "7.81399"  RPKM2 "8.149"
ENSG00000101126.8  RPKM1 "12.0082"  RPKM2 "8.55263"

解析source_geneid.csv是微不足道的:

$ awk 'NR>1{print $2}' geneid 
ENSG00000108270
ENSG00000101126

全部放在一起:

$ grep -f <(awk 'NR>1{print $2}' geneid) <(awk 'NR>1{gsub(/ ?"/,"",$1);print $1,$3,$4}' FS=';' RS='gene_id' rnaseq)
ENSG00000108270.5  RPKM1 "7.81399"  RPKM2 "8.149"
ENSG00000101126.8  RPKM1 "12.0082"  RPKM2 "8.55263"