我有这个带有超过60,000个寄存器的表格文件:
head -2 hg38.txt
717 NM_000525 chr11 - 17385248 17388659 17386918 17388091 117385248, 17388659, 0 KCNJ11 cmpl cmpl 0,
987 NM_000242 chr10 - 52765379 52771700 52768136 52771635 452765379,52769246,52770669,52771448, 52768510,52769315,52770786,52771700, 0 MBL2 cmpl cmpl 1,1,1,0,
以前,我从中提取,选择第三列的选定行,并将其保存在另一个chromosomes.txt文件中
gawk '{print $3}' hg38.txt | sort -u | grep -v "_" | sort -o chromosomes.txt
head -5 chromosomes.txt
chr1
chr10
chr11
chr12
chr13
现在,我想选择那些与“染色体”具有相同字段的寄存器,但由于我还想在我的最终结果中使用另一个字段,我这样做:
gawk '{print $3, $13}' hg38.txt | sort | join - chromosomes.txt > final.txt
但是join命令警告:
join: -:833: is not sorted: chr10 GLRX3
我如何加入他们?也可以在加入后,而不是创建临时文件,只需添加|?例如:
gawk '{print $3, $13}' hg38.txt | sort | join - chromosomes.txt | gawk '{print $2}' | uniq -c | gawk 'BEGIN{t=0}{t=t+$1} END{print t/NR}'
提前感谢您的回答!
答案 0 :(得分:1)
你为什么不在gawk中进行过滤?
gawk '{ if (!match($3,"_")) {print $3, $13} }' hg38.txt