我有一个gff文件,我只提取'属性'信息。
ID=id0;Dbxref=taxon:471472;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;serotype=L2;strain=434/Bu
ID=gene0;Dbxref=GeneID:5858769;Name=hemB;gbkey=Gene;gene=hemB;gene_biotype=protein_coding;locus_tag=CTL0001
"ID=cds0;Parent=gene0;Dbxref=Genbank:YP_001654092.1,GeneID:5858769;Name=YP_001654092.1;Note=catalyzes the formation of porphobilinogen from 5-aminolevulinate;gbkey=CDS;gene=hemB;product=delta-aminolevulinic acid dehydratase;protein_id=YP_001654092.1;transl_table=11"
ID=id1;Dbxref=GeneID:5858769;Note=PS00169 Delta-aminolevulinic acid dehydratase active site.;gbkey=misc_feature;gene=hemB;inference=protein motif:Prosite:PS00169
ID=gene1;Dbxref=GeneID:5857942;Name=nqrA;gbkey=Gene;gene=nqrA;gene_biotype=protein_coding;locus_tag=CTL0002
"ID=cds1;Parent=gene1;Dbxref=Genbank:YP_001654093.1,GeneID:5857942;Name=YP_001654093.1;Note=uses the energy from reduction of ubiquinone-1 to ubiquinol to move Na(+) ions from the cytoplasm to the periplasm;gbkey=CDS;gene=nqrA;product=Na(+)-translocating NADH-quinone reductase subunit A;protein_id=YP_001654093.1;transl_table=11"
ID=gene2;Dbxref=GeneID:5858572;Name=CTL0003;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CTL0003
"ID=cds2;Parent=gene2;Dbxref=Genbank:YP_001654094.1,GeneID:5858572;Name=YP_001654094.1;gbkey=CDS;product=hypothetical protein;protein_id=YP_001654094.1;transl_table=11"
我将其转换为csv文件,以便在python
中使用dataframe处理它fn = pd.read_table("D:/J/gff.csv",sep=';',
names=["a", "b", "c", "d","e","f","g","h","i"])
df = pd.DataFrame(fn)
a b \
0 ID=id0 Dbxref=taxon:471472
1 ID=gene0 Dbxref=GeneID:5858769
2 ID=cds0;Parent=gene0;Dbxref=Genbank:YP_0016540... NaN
3 ID=id1 Dbxref=GeneID:5858769
4 ID=gene1 Dbxref=GeneID:5857942
c d \
0 Is_circular=true Name=ANONYMOUS
1 Name=hemB gbkey=Gene
2 NaN NaN
3 Note=PS00169 Delta-aminolevulinic acid dehydra... gbkey=misc_feature
4 Name=nqrA gbkey=Gene
e f g \
0 gbkey=Src genome=chromosome mol_type=genomic DNA
1 gene=hemB gene_biotype=protein_coding locus_tag=CTL0001
2 NaN NaN NaN
3 gene=hemB inference=protein motif:Prosite:PS00169 NaN
4 gene=nqrA gene_biotype=protein_coding locus_tag=CTL0002
h i
0 serotype=L2 strain=434/Bu
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
现在我想提取ID只包含'geneX'的行(X可以是不同的数字)。我试着用
df = df[df['a'].str.contains(['ID=gene'])]
但它给出了错误
TypeError: unhashable type: 'list'
我检查了所有列的dtypes都是对象。我想用'ID = geneX'的字符串模式选择那些行。
因此可以拥有这样的数据框,
ID Name locus_tag ..
gene0 hemB CTL0001
gene1 nqrA CTL0002
..