以vcfs格式解析txt文件

时间:2019-12-13 04:06:57

标签: python pandas

我想将txt文件中的信息提取到数据框中,并在数据中包含以下字段

1) GENEINFO
2) ID
3) POS
4) ALT
5) CLNSIG
6) CLNDN 

The txt file is here

我编写了以下代码,试图从文件中获取信息,但不知道如何进行。您能帮我指导一些想法吗?

import io
import os
import pandas as pd


def read_vcf(path):
    with open('clinvar_final.txt', 'r') as f:
        lines = [l for l in f if not l.startswith('##')]
    return pd.read_csv(
        io.StringIO(''.join(lines)),
        dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
               'QUAL': str, 'FILTER': str, 'INFO': str},
        sep='\t'
    ).rename(columns={'#CHROM': 'CHROM'})

1 个答案:

答案 0 :(得分:0)

您可以使用

进行阅读
df = pd.read_csv('clinvar_final.txt', comment='#', sep='\t')

之后,您将已经具有表格2)ID 3)POS 4)ALT

print(df[['ID', 'POS', 'ALT']].head())

给予

       ID      POS ALT
0  475283  1014O42   A
1  542074  1O14122   T
2  183381  1014143   T
3  542075  1014179   T
4  475278  1014217   T

其他信息(1)GENEINFO 5)CLNSIG 6)CLNDN)作为一个字符串位于INFO列中,您可以使用{{ 1}}

regex

结果

df['GENEINFO'] = df['INFO'].str.extract('GENEINFO=([^;]*)')
df['CLNSIG'] = df['INFO'].str.extract('CLNSIG=([^;]*)')
df['CLNDN'] = df['INFO'].str.extract('CLNDN=([^;]*)')

print(df['GENEINFO'].head())
print(df['CLNSIG'].head())
print(df['CLNDN'].head())

0    ISG15:9636
1    ISG15:9636
2    ISG15:9636
3    ISG15:9636
4    ISG15:9636
Name: GENEINFO, dtype: object

0                    Benign
1    Uncertain_significance
2                Pathogenic
3    Uncertain_significance
4                    Benign
Name: CLNSIG, dtype: object

0    Immunodeficiency_38_with_basal_ganglia_calcifi...
1    Immunodeficiency_38_with_basal_ganglia_calcifi...
2    Immunodeficiency_38_with_basal_ganglia_calcifi...
3    Immunodeficiency_38_with_basal_ganglia_calcifi...
4    Immunodeficiency_38_with_basal_ganglia_calcifi...
Name: CLNDN, dtype: object