删除具有错误等位基因的SNP

时间:2016-06-28 10:59:02

标签: python bash unix extract rows

我有两个看起来像这样的文件:

  1. 参考面板( ReferencePanel.csv

    "id","position","allele0","allele1","allele1_frequency"
    "seq-rs1010355",55102179,"T","C",0.098
    "seq-rs272408",55103603,"C","T",0.787
    "seq-rs11669899",55104559,"A","T",0.029
    "imm_19_59798585",55106773,"A","G",0.499
    
  2. BIM文件( myfile.bim

    19    19:55102179    0    55102179    C    T
    19    19:55103603    0    55103603    C    T
    19    19:55104559    0    55104559    G    C
    19    19:55106773    0    55106773    A    T
    
  3. 我想在BIM文件中删除所有两个等位基因与参考面板不同的行。换句话说,我只想保留与参考面板具有完全相同等位基因的行 - 顺序无关紧要。

    示例

    参考等位基因:

    "seq-rs1010355",55102179,"T","C",0.098
    "seq-rs272408",55103603,"C","T",0.787
    "seq-rs11669899",55104559,"A","T",0.029
    "imm_19_59798585",55106773,"A","G",0.499
    

    BIM文件(myfile.bim)

    19    19:55102179 0   55102179    C   T
    19    19:55103603 0   55103603    C   T
    19    19:55104559 0   55104559    G   C
    19    19:55106773 0   55106773    A   T
    

    仅保留以下行:

    19    19:55102179 0   55102179    C   T
    19    19:55103603 0   55103603    C   T
    

    我设法使用这些行从参考面板中提取所有位置:

    #Create an empty list 
    positions=[]
    
    #Populate list with positions 
    for line in open("ReferencePanel.csv"):
        columns = line.split(",")
        positions.append(columns[1])
    #Remove first element which corresponds to the header
    positions.pop(0)
    

    但我被困在这里。我希望有一个人可以帮助我。 提前谢谢!

1 个答案:

答案 0 :(得分:1)

如果您不反对使用awk,则可以使用以下命令:

awk -F'[",]*' 'NR==FNR && $4 && $5 {ref[$4][$5]=1} NR>FNR {FS=" *"} NR>FNR && ref[$6][$7]' reference.csv myfile.bim

导致:

19    19:55102179    0    55102179    C    T
19    19:55103603    0    55103603    C    T
19    19:55106773    0    55106773    A    T

注意最后一行与参考文件的第4行(A,T)匹配

说明:

-F'[",]*'与CSV分隔符匹配,用于解析参考文件

NR==FNR && $4 && $5 {ref[$4][$5]=1}从参考文件中获取所有C,T,G,A

NR>FNR {FS=" *"}正在将awk字段分隔符更改为空格以解析第二个文件

NR>FNR && ref[$6][$7]是第二个文件的打印行,如果第6和第7列与数组中存储的匹配