bash:在线恢复信息

时间:2019-06-03 21:06:28

标签: bash awk grep

从使用先前脚本创建的文件中,我想以某种方式检索我的信息。确实,我希望在线获得rsID(唯一),基因名称(唯一)和转录名称列表的信息。

这是我的rsID.txt文件的一部分:

rsID
rs142849724
rs141989890

这是我的rsID_out.txt的一部分:

"1","rs142849724","ENSG00000228794","ENST00000624927"
"2","rs142849724","ENSG00000228794","ENST00000623808"
"3","rs142849724","ENSG00000228794","ENST00000445118"
"4","rs142849724","ENSG00000228794","ENST00000448975"
"5","rs142849724","ENSG00000228794","ENST00000610067"
"6","rs142849724","ENSG00000228794","ENST00000608189"
"7","rs142849724","ENSG00000228794","ENST00000609139"
"8","rs142849724","ENSG00000228794","ENST00000449005"
"9","rs142849724","ENSG00000228794","ENST00000416570"
"10","rs142849724","ENSG00000228794","ENST00000623070"
"11","rs142849724","ENSG00000228794","ENST00000609009"
"12","rs142849724","ENSG00000228794","ENST00000622921"
"13","rs141989890","ENSG00000228794","ENST00000624927"
"14","rs141989890","ENSG00000228794","ENST00000623808"
"15","rs141989890","ENSG00000228794","ENST00000445118"
"16","rs141989890","ENSG00000228794","ENST00000448975"
"17","rs141989890","ENSG00000228794","ENST00000610067"
"18","rs141989890","ENSG00000228794","ENST00000608189"
"19","rs141989890","ENSG00000228794","ENST00000609139"
"20","rs141989890","ENSG00000228794","ENST00000449005"
"21","rs141989890","ENSG00000228794","ENST00000416570"
"22","rs141989890","ENSG00000228794","ENST00000623070"
"23","rs141989890","ENSG00000228794","ENST00000609009"
"24","rs141989890","ENSG00000228794","ENST00000622921"

我创建了以下代码:

while read line
do
    res=`grep "$line" rsID_out.txt | awk -F ',' '!seen[$3]++ {print $3 ";"}'`
    ra=`grep "$line" rsID_out.txt | awk -F ',' '{print $4}'`
    echo "$line ; $res ; $ra"
done < rsID.txt

我得到此文件作为结果:

rs142849724 ; "ENSG00000228794" ; "ENST00000624927"
"ENST00000623808"
"ENST00000445118"
"ENST00000448975"
"ENST00000610067"
"ENST00000608189"
"ENST00000609139"
"ENST00000449005"
"ENST00000416570"
"ENST00000623070"
"ENST00000609009"
"ENST00000622921"

rs141989890 ; "ENSG00000228794" ; "ENST00000624927"
"ENST00000623808"
"ENST00000445118"
"ENST00000448975"
"ENST00000610067"
"ENST00000608189"
"ENST00000609139"
"ENST00000449005"
"ENST00000416570"
"ENST00000623070"
"ENST00000609009"
"ENST00000622921"

但是我希望使用以下格式的文件:

rs142849724;"ENSG00000228794";"ENST00000624927"|"ENST00000623808"|"ENST00000445118"|"ENST00000448975"|"ENST00000610067"|"ENST00000608189"|"ENST00000609139"|"ENST00000449005"|"ENST00000416570"|"ENST00000623070"|"ENST00000609009"|"ENST00000622921"

rs141989890;"ENSG00000228794";"ENST00000624927"|"ENST00000623808"|"ENST00000445118"|"ENST00000448975"|"ENST00000610067"|"ENST00000608189"|"ENST00000609139"|"ENST00000449005"|"ENST00000416570"|"ENST00000623070"|"ENST00000609009"|"ENST00000622921"

怎么做?

谢谢

edit:我想我终于了解了如何格式化我的帖子。谢谢!谢谢!实际上,我想将rsID_out.txt重组为每个rs ID一行。抱歉,如果我的帖子格式不正确,您会遇到任何问题。 rsID.txt文件包含第一行rsID行,但没有空行。我注意到您的回答,评论和建议,并希望您能给予答复。

3 个答案:

答案 0 :(得分:1)

假设有两个数据文件:

  • rsID.txt 包含所需的rsID来定位:
rs142849724
rs141989890
  • rsID_out.txt 包含:
"1","rs142849724","ENSG00000228794","ENST00000624927" 
"2","rs142849724","ENSG00000228794","ENST00000623808" 
"3","rs142849724","ENSG00000228794","ENST00000445118" 
"4","rs142849724","ENSG00000228794","ENST00000448975" 
"5","rs142849724","ENSG00000228794","ENST00000610067" 
"6","rs142849724","ENSG00000228794","ENST00000608189" 
"7","rs142849724","ENSG00000228794","ENST00000609139" 
"8","rs142849724","ENSG00000228794","ENST00000449005" 
"9","rs142849724","ENSG00000228794","ENST00000416570" 
"10","rs142849724","ENSG00000228794","ENST00000623070" 
"11","rs142849724","ENSG00000228794","ENST00000609009" 
"12","rs142849724","ENSG00000228794","ENST00000622921" 
"13","rs141989890","ENSG00000228794","ENST00000624927" 
"14","rs141989890","ENSG00000228794","ENST00000623808" 
"15","rs141989890","ENSG00000228794","ENST00000445118" 
"16","rs141989890","ENSG00000228794","ENST00000448975" 
"17","rs141989890","ENSG00000228794","ENST00000610067" 
"18","rs141989890","ENSG00000228794","ENST00000608189" 
"19","rs141989890","ENSG00000228794","ENST00000609139" 
"20","rs141989890","ENSG00000228794","ENST00000449005" 
"21","rs141989890","ENSG00000228794","ENST00000416570" 
"22","rs141989890","ENSG00000228794","ENST00000623070" 
"23","rs141989890","ENSG00000228794","ENST00000609009"
"24","rs141989890","ENSG00000228794","ENST00000622921"

然后使用awk产生请求的输出:

awk -F, '
    NR==FNR {
        x[$1]++
        next
    }
    {
        gsub(/"/, "", $2)
        k = $2 ";" $3
    }
    $2 in x { a[k] = a[k] "|" $4 }
    END {
        for (k in a) {
            sub(/[|]/, "", a[k])
            print k ";" a[k]
        }
    }
' rsID.txt rsID_out.txt
  • NR==FNR {...}-阅读rsID列表以查找
  • gsub-删除双引号
  • k-键(rsID;“基因名称”)?
  • $2 in x-仅处理列表中的rsID
  • END-删除第一个管道,然后打印每个键及其值

注意:该代码假定行无需分组,并且可以按任何顺序出现。 awk使用的内存将大致与rsID_out.txt的大小成比例,如果该文件很大,则可能会出现问题。由Dudi Boy和Ed Morton等人选择的awk解决方案确实假设对线进行了分组(基于提供的样本数据的合理假设)。这样一来,他们只需要很少的内存即可。


根据注释中的建议,您还可以使用sed修改代码。像这样:

while read line; do
    res=$( grep "$line" rsID_out.txt | awk -F , '!seen[$3]++ {print $3}' )
    ra=$( grep "$line" rsID_out.txt | awk -F , '{printf "|%s", $4} END {print ""}' | sed 's/[|]//' )
    echo "$line;$res;$ra"
done < rsID.txt

效率会有所降低:对于输入的每一行,grep和awk都会被调用两次并sed一次,而不是仅仅对awk进行一次整体调用。对于大量数据,这可能很重要。

答案 1 :(得分:0)

听起来像这样可能,只要您需要:

$ cat file
"1","rs142849724","ENSG00000228794","ENST00000624927"
"2","rs142849724","ENSG00000228794","ENST00000623808"
"3","rs142849724","ENSG00000228794","ENST00000445118"
"13","rs141989890","ENSG00000228794","ENST00000624927"
"14","rs141989890","ENSG00000228794","ENST00000623808"
"15","rs141989890","ENSG00000228794","ENST00000445118"

$ cat tst.awk
BEGIN { FS=","; OFS="|" }
$2 != prev {
    if ( NR > 1 ) {
        print rec
    }
    prev = $2
    gsub(/"/,"",$2)
    rec = $2 ";" $3 ";" $4
    next
}
{ rec = rec OFS $4 }
END { print rec }

$ awk -f tst.awk file
rs142849724;"ENSG00000228794";"ENST00000624927"|"ENST00000623808"|"ENST00000445118"
rs141989890;"ENSG00000228794";"ENST00000624927"|"ENST00000623808"|"ENST00000445118"

如果这不是您所需要的,那么请更新您的问题以阐明您的要求并提供更真实的代表性示例输入/输出。

答案 2 :(得分:0)

我建议在awk上运行一个rsID_out.txt脚本,该脚本将生成所需的格式化数据。

script.awk

!seen[$2""$3] {         # if new sequence of input lines
    seen[$2""$3] = 1;   # mark the new sequence
    if (rowCount++) print row; # if not first output row, print previous output row
    gsub("\"","",$2);   # clear redundant quote marks from 2nd field in input line
    row = $2";"$3";"$4; # assign 2nd and 3rd fields from input line, to new output row
    next;               # proceed to next input line
}
{ row = row"|"$4;}      # add 4th field from input line to output row
END { print row; }      # print the last output row.

运行脚本:

 awk -F "," -f script.awk rsID_out.txt

输出:

rs142849724;"ENSG00000228794";"ENST00000624927"|"ENST00000623808"|"ENST00000445118"|"ENST00000448975"|"ENST00000610067"|"ENST00000608189"|"ENST00000609139"|"ENST00000449005"|"ENST00000416570"|"ENST00000623070"|"ENST00000609009"|"ENST00000622921"
rs141989890;"ENSG00000228794";"ENST00000624927"|"ENST00000623808"|"ENST00000445118"|"ENST00000448975"|"ENST00000610067"|"ENST00000608189"|"ENST00000609139"|"ENST00000449005"|"ENST00000416570"|"ENST00000623070"|"ENST00000609009"|"ENST00000622921"

请发表有关输出格式和逻辑的评论。

请注意,第一字段和第二字段定界符为;,而第三字段到最后一个字段定界符为|