我正在搜索/尝试将源文件中的术语列表(Ensemble Gene ID)与目标rnaseq.gtf文件中的术语列表进行匹配。我想将匹配/ grep'd ENSEMBLE基因ID及其对应的RPKM1和RPKM2值打印到单独的输出文件中。
source_geneid.csv文件如下所示:
GO Genes ENSEMBLE Gene ID
AATF ENSG00000108270
ADNP ENSG00000101126
target_rnaseq.gtf文件:
chr17 gencodeV7 gene 35306175 35414170 0.669763 + . gene_id "ENSG00000108270.5"; transcript_ids "ENST00000225402.4,"; RPKM1 "7.81399"; RPKM2 "8.149"; iIDR "0.000";
chr20 gencodeV7 gene 49505585 49547750 0.862675 - . gene_id "ENSG00000101126.8"; transcript_ids "ENST00000371602.2,ENST00000349014.3,ENST00000396029.3,ENST00000396032.1,ENST00000534467.1,"; RPKM1 "12.0082"; RPKM2 "8.55263"; iIDR "0.000";
输出文件包含匹配的/ grep'd gene_id及其对应的RPKM1和RPMK2值:
ENSG00000108270.5 RPKM1 "7.81399" RPKM2 "8.149"
ENSG00000101126.8 RPKM1 "12.0082" RPKM2 "8.55263"
我在命令行上完成了它:
grep -w "ENSG*" target_rnaseq.gtf| awk '{print $10,$13,$14,$15,$16}' >> output.txt
我也尝试过(感谢fedorqui)
while read line
do
var=$(echo $line | awk '{print $2}')
while read line
do
var=$(echo $line | awk '{print $2}')
grep -w "$var" target_rnaseq.gtf | awk '{print $10,$13,$14,$15,$16}' >> output.txt
done < source_geneid.csv
但它打印出目标文件中的所有基因id。
答案 0 :(得分:3)
target_rnaseq.gtf
似乎格式正确,因此您可以轻松处理它以简化工作,例如获取您感兴趣的值非常简单:
$ awk 'NR>1{gsub(/ ?"/,"",$1);print $1,$3,$4}' FS=';' RS='gene_id' rnaseq
ENSG00000108270.5 RPKM1 "7.81399" RPKM2 "8.149"
ENSG00000101126.8 RPKM1 "12.0082" RPKM2 "8.55263"
解析source_geneid.csv
是微不足道的:
$ awk 'NR>1{print $2}' geneid
ENSG00000108270
ENSG00000101126
全部放在一起:
$ grep -f <(awk 'NR>1{print $2}' geneid) <(awk 'NR>1{gsub(/ ?"/,"",$1);print $1,$3,$4}' FS=';' RS='gene_id' rnaseq)
ENSG00000108270.5 RPKM1 "7.81399" RPKM2 "8.149"
ENSG00000101126.8 RPKM1 "12.0082" RPKM2 "8.55263"