如何在python中的.gff文件中按字符串保存行

时间:2018-01-08 22:45:48

标签: python pandas dataframe split

我有一个gff文件,我只提取'属性'信息。

ID=id0;Dbxref=taxon:471472;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;serotype=L2;strain=434/Bu
ID=gene0;Dbxref=GeneID:5858769;Name=hemB;gbkey=Gene;gene=hemB;gene_biotype=protein_coding;locus_tag=CTL0001
"ID=cds0;Parent=gene0;Dbxref=Genbank:YP_001654092.1,GeneID:5858769;Name=YP_001654092.1;Note=catalyzes the formation of porphobilinogen from 5-aminolevulinate;gbkey=CDS;gene=hemB;product=delta-aminolevulinic acid dehydratase;protein_id=YP_001654092.1;transl_table=11"
ID=id1;Dbxref=GeneID:5858769;Note=PS00169 Delta-aminolevulinic acid dehydratase active site.;gbkey=misc_feature;gene=hemB;inference=protein motif:Prosite:PS00169
ID=gene1;Dbxref=GeneID:5857942;Name=nqrA;gbkey=Gene;gene=nqrA;gene_biotype=protein_coding;locus_tag=CTL0002
"ID=cds1;Parent=gene1;Dbxref=Genbank:YP_001654093.1,GeneID:5857942;Name=YP_001654093.1;Note=uses the energy from reduction of ubiquinone-1 to ubiquinol to move Na(+) ions from the cytoplasm to the periplasm;gbkey=CDS;gene=nqrA;product=Na(+)-translocating NADH-quinone reductase subunit A;protein_id=YP_001654093.1;transl_table=11"
ID=gene2;Dbxref=GeneID:5858572;Name=CTL0003;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CTL0003
"ID=cds2;Parent=gene2;Dbxref=Genbank:YP_001654094.1,GeneID:5858572;Name=YP_001654094.1;gbkey=CDS;product=hypothetical protein;protein_id=YP_001654094.1;transl_table=11"

我将其转换为csv文件,以便在python

中使用dataframe处理它
fn = pd.read_table("D:/J/gff.csv",sep=';',
                   names=["a", "b", "c", "d","e","f","g","h","i"])
df = pd.DataFrame(fn)

                                                   a                      b  \
0                                             ID=id0    Dbxref=taxon:471472   
1                                           ID=gene0  Dbxref=GeneID:5858769   
2  ID=cds0;Parent=gene0;Dbxref=Genbank:YP_0016540...                    NaN   
3                                             ID=id1  Dbxref=GeneID:5858769   
4                                           ID=gene1  Dbxref=GeneID:5857942   

                                                   c                   d  \
0                                   Is_circular=true      Name=ANONYMOUS   
1                                          Name=hemB          gbkey=Gene   
2                                                NaN                 NaN   
3  Note=PS00169 Delta-aminolevulinic acid dehydra...  gbkey=misc_feature   
4                                          Name=nqrA          gbkey=Gene   

           e                                        f                     g  \
0  gbkey=Src                        genome=chromosome  mol_type=genomic DNA   
1  gene=hemB              gene_biotype=protein_coding     locus_tag=CTL0001   
2        NaN                                      NaN                   NaN   
3  gene=hemB  inference=protein motif:Prosite:PS00169                   NaN   
4  gene=nqrA              gene_biotype=protein_coding     locus_tag=CTL0002   

             h              i  
0  serotype=L2  strain=434/Bu  
1          NaN            NaN  
2          NaN            NaN  
3          NaN            NaN  
4          NaN            NaN

现在我想提取ID只包含'geneX'的行(X可以是不同的数字)。我试着用

df = df[df['a'].str.contains(['ID=gene'])]

但它给出了错误

TypeError: unhashable type: 'list'

我检查了所有列的dtypes都是对象。我想用'ID = geneX'的字符串模式选择那些行。

因此可以拥有这样的数据框,

ID    Name    locus_tag   ..
gene0  hemB        CTL0001
gene1  nqrA        CTL0002
..

0 个答案:

没有答案