我想将txt文件中的信息提取到数据框中,并在数据中包含以下字段
1) GENEINFO
2) ID
3) POS
4) ALT
5) CLNSIG
6) CLNDN
我编写了以下代码,试图从文件中获取信息,但不知道如何进行。您能帮我指导一些想法吗?
import io
import os
import pandas as pd
def read_vcf(path):
with open('clinvar_final.txt', 'r') as f:
lines = [l for l in f if not l.startswith('##')]
return pd.read_csv(
io.StringIO(''.join(lines)),
dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
'QUAL': str, 'FILTER': str, 'INFO': str},
sep='\t'
).rename(columns={'#CHROM': 'CHROM'})
答案 0 :(得分:0)
您可以使用
进行阅读df = pd.read_csv('clinvar_final.txt', comment='#', sep='\t')
之后,您将已经具有表格2)ID
3)POS
4)ALT
print(df[['ID', 'POS', 'ALT']].head())
给予
ID POS ALT
0 475283 1014O42 A
1 542074 1O14122 T
2 183381 1014143 T
3 542075 1014179 T
4 475278 1014217 T
其他信息(1)GENEINFO
5)CLNSIG
6)CLNDN
)作为一个字符串位于INFO
列中,您可以使用{{ 1}}
regex
结果
df['GENEINFO'] = df['INFO'].str.extract('GENEINFO=([^;]*)')
df['CLNSIG'] = df['INFO'].str.extract('CLNSIG=([^;]*)')
df['CLNDN'] = df['INFO'].str.extract('CLNDN=([^;]*)')
print(df['GENEINFO'].head())
print(df['CLNSIG'].head())
print(df['CLNDN'].head())
0 ISG15:9636
1 ISG15:9636
2 ISG15:9636
3 ISG15:9636
4 ISG15:9636
Name: GENEINFO, dtype: object
0 Benign
1 Uncertain_significance
2 Pathogenic
3 Uncertain_significance
4 Benign
Name: CLNSIG, dtype: object
0 Immunodeficiency_38_with_basal_ganglia_calcifi...
1 Immunodeficiency_38_with_basal_ganglia_calcifi...
2 Immunodeficiency_38_with_basal_ganglia_calcifi...
3 Immunodeficiency_38_with_basal_ganglia_calcifi...
4 Immunodeficiency_38_with_basal_ganglia_calcifi...
Name: CLNDN, dtype: object