我有两个看起来像这样的文件:
参考面板( ReferencePanel.csv )
"id","position","allele0","allele1","allele1_frequency" "seq-rs1010355",55102179,"T","C",0.098 "seq-rs272408",55103603,"C","T",0.787 "seq-rs11669899",55104559,"A","T",0.029 "imm_19_59798585",55106773,"A","G",0.499
BIM文件( myfile.bim )
19 19:55102179 0 55102179 C T 19 19:55103603 0 55103603 C T 19 19:55104559 0 55104559 G C 19 19:55106773 0 55106773 A T
我想在BIM文件中删除所有两个等位基因与参考面板不同的行。换句话说,我只想保留与参考面板具有完全相同等位基因的行 - 顺序无关紧要。
示例:
参考等位基因:
"seq-rs1010355",55102179,"T","C",0.098 "seq-rs272408",55103603,"C","T",0.787 "seq-rs11669899",55104559,"A","T",0.029 "imm_19_59798585",55106773,"A","G",0.499
BIM文件(myfile.bim)
19 19:55102179 0 55102179 C T 19 19:55103603 0 55103603 C T 19 19:55104559 0 55104559 G C 19 19:55106773 0 55106773 A T
仅保留以下行:
19 19:55102179 0 55102179 C T 19 19:55103603 0 55103603 C T
我设法使用这些行从参考面板中提取所有位置:
#Create an empty list
positions=[]
#Populate list with positions
for line in open("ReferencePanel.csv"):
columns = line.split(",")
positions.append(columns[1])
#Remove first element which corresponds to the header
positions.pop(0)
但我被困在这里。我希望有一个人可以帮助我。 提前谢谢!
答案 0 :(得分:1)
如果您不反对使用awk
,则可以使用以下命令:
awk -F'[",]*' 'NR==FNR && $4 && $5 {ref[$4][$5]=1} NR>FNR {FS=" *"} NR>FNR && ref[$6][$7]' reference.csv myfile.bim
导致:
19 19:55102179 0 55102179 C T
19 19:55103603 0 55103603 C T
19 19:55106773 0 55106773 A T
注意最后一行与参考文件的第4行(A,T)匹配
说明:
-F'[",]*'
与CSV分隔符匹配,用于解析参考文件
NR==FNR && $4 && $5 {ref[$4][$5]=1}
从参考文件中获取所有C,T,G,A
NR>FNR {FS=" *"}
正在将awk
字段分隔符更改为空格以解析第二个文件
NR>FNR && ref[$6][$7]
是第二个文件的打印行,如果第6和第7列与数组中存储的匹配