单元格值到熊猫中的列名

时间:2020-03-28 09:51:57

标签: python pandas dataframe gff

我有以下熊猫数据框(这是一个gff文件):

df = pd.DataFrame.from_dict({'scaffold name': {0: 'Tname16C00001.1',
  1: 'Tname16C00001.1',
  2: 'Tname16C00001.1',
  3: 'Tname16C00001.1',
  4: 'Tname16C00001.1',
  5: 'Tname16C00001.1',
  6: 'Tname16C00001.1',
  7: 'Tname16C00001.1',
  8: 'Tname16C00001.1',
  9: 'Tname16C00001.1'},
 'source': {0: 'annotation',
  1: 'feature',
  2: 'feature',
  3: 'feature',
  4: 'feature',
  5: 'feature',
  6: 'feature',
  7: 'feature',
  8: 'feature',
  9: 'feature'},
 'type': {0: 'remark',
  1: 'region',
  2: 'gene',
  3: 'CDS',
  4: 'gene',
  5: 'CDS',
  6: 'gene',
  7: 'CDS',
  8: 'gene',
  9: 'CDS'},
 'start': {0: 1,
  1: 1,
  2: 943,
  3: 943,
  4: 1964,
  5: 1964,
  6: 2386,
  7: 2386,
  8: 3998,
  9: 3998},
 'stop': {0: 12018,
  1: 12018,
  2: 1992,
  3: 1992,
  4: 2179,
  5: 2179,
  6: 3897,
  7: 3897,
  8: 5152,
  9: 5152},
 'score': {0: '.',
  1: '.',
  2: '.',
  3: '.',
  4: '.',
  5: '.',
  6: '.',
  7: '.',
  8: '.',
  9: '.'},
 'strand': {0: '.',
  1: '+',
  2: '+',
  3: '+',
  4: '-',
  5: '-',
  6: '+',
  7: '+',
  8: '+',
  9: '+'},
 'phase': {0: '.',
  1: '.',
  2: '.',
  3: '0',
  4: '.',
  5: '0',
  6: '.',
  7: '0',
  8: '.',
  9: '0'},
 'attr': {0: 'accession=Tname16C00001.1;comment=Annotations were generated from the MicroScope annotation platform. Additional results are available at http://www.genoscope.cns.fr/agc/microscope . This file is not suitable for direct databank submission. To contact us: mage%40genoscope.cns.fr .%0AMicroscope genomic region coordinates: 1..12018;data_file_division=BCT;date=05-NOV-2019;organism=Genus Species Strain;source=Genus Species Strain;topology=linear',
  1: 'Is_circular=false;Note=whole genome shotgun linear WGS contig 1;db_xref=taxon:1907535,MaGe/Organism_id:12155,MaGe/Species_code:Tname16,MaGe/Sequence_id:16744,MaGe/Scaffold_id:1,MaGe/Contig_id:1,MaGe/Contig_label:NNNNNODE_1_length_11870_cov_199.017943;mol_type=genomic DNA;organism=Candidatus Thiosymbion hypermnestrae;strain=Strain',
  2: 'locus_tag=Tname16_v1_10001',
  3: 'ID=71429338;db_xref=MaGe:71429338;inference=ab initio prediction:AMIGene:2.0;locus_tag=Tname16_v1_10001;note=Evidence 5 : Unknown function;product=protein of unknown function;transl_table=11;translation=MG',
  4: 'locus_tag=Tname16_v1_10002',
  5: 'ID=71429339;db_xref=MaGe:71429339;inference=ab initio prediction:AMIGene:2.0;locus_tag=Tname16_v1_10002;note=Evidence 5 : Unknown function;product=protein of unknown function;transl_table=11;translation=MI',
  6: 'gene=wcaJ;locus_tag=Tname16_v1_10003',
  7: 'ID=71429340;db_xref=MaGe:71429340;ec_number=2.7.8.31;gene=wcaJ;inference=ab initio prediction:AMIGene:2.0;locus_tag=Tname16_v1_10003;product=UDP-glucose:undecaprenyl-phosphate glucose-1-phosphate transferase;transl_table=11;translation=MY',
  8: 'gene=rffE;locus_tag=Tname16_v1_10004',
  9: 'ID=71429341;db_xref=MaGe:71429341;ec_number=5.1.3.14;function=1.6.4 : Enterobacterial common antigen %28surface glycolipid%29,6.1 : Membrane,6.3 : Surface antigens %28ECA%2C O antigen of LPS%29,7.1 : Cytoplasm;gene=rffE;inference=ab initio prediction:AMIGene:2.0;locus_tag=Tname16_v1_10004;note=Evidence 2a : Function from experimental evidences in other organisms%3B PubMedId 11106477%2C 7559340%2C 8170390%2C 8226648%3B Product type e : enzyme;product=UDP-N-acetyl glucosamine-2-epimerase;transl_table=11;translation=MT'}})

attr列中的值实际上是其他列,但文件格式gff不允许。我想将此列中的文本拆分为多个列。这些值从广义上讲是字典,这意味着每个键都有一个用=(例如accession=Tname16C00001.1)隔开的值,并且每个键值对都用;隔开。

我首先将每个键值对分为df行的两列:

s = df['attr'].str.split(';').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = 'attr'
del df['attr']
df.join(s)
df.join(s.apply(lambda x: pd.Series(x.split('='))))

这为我提供了以下df,具有重复的行索引:

    scaffold name   source  type    start   stop    score   strand  phase   0   1
0   Tname16C00001.1 annotation  remark  1   12018   .   .   .   accession   Tname16C00001.1
0   Tname16C00001.1 annotation  remark  1   12018   .   .   .   comment Annotations were generated from the MicroScope...
0   Tname16C00001.1 annotation  remark  1   12018   .   .   .   data_file_division  BCT
0   Tname16C00001.1 annotation  remark  1   12018   .   .   .   date    05-NOV-2019
0   Tname16C00001.1 annotation  remark  1   12018   .   .   .   organism    Genus Species Strain
0   Tname16C00001.1 annotation  remark  1   12018   .   .   .   source  Genus Species Strain
0   Tname16C00001.1 annotation  remark  1   12018   .   .   .   topology    linear
1   Tname16C00001.1 feature region  1   12018   .   +   .   Is_circular false
1   Tname16C00001.1 feature region  1   12018   .   +   .   Note    whole genome shotgun linear WGS contig 1
1   Tname16C00001.1 feature region  1   12018   .   +   .   db_xref taxon:1907535,MaGe/Organism_id:12155,MaGe/Spec...
1   Tname16C00001.1 feature region  1   12018   .   +   .   mol_type    genomic DNA
1   Tname16C00001.1 feature region  1   12018   .   +   .   organism    Candidatus Thiosymbion hypermnestrae
1   Tname16C00001.1 feature region  1   12018   .   +   .   strain  Strain
2   Tname16C00001.1 feature gene    943 1992    .   +   .   locus_tag   Tname16_v1_10001
...

现在,如何为每个索引分别对01列中的每个“键值”对进行转置和汇总?所有空单元格都可以有NaN(会很多)。

我想要的输出应该是:

    scaffold name   source  type    start   stop    score   strand  phase   accession   comment data_file_division  date    organism    source  topology    Is_circular Note    db_xref mol_type    organism    strain  locus_tag
0   Tname16C00001.1 annotation  remark  1   12018   .   .   .   Tname16C00001.1 Annotations were generated from the MicroScope...   BCT 05.Nov.19   Genus Species Strain    Genus Species Strain    linear  NaN NaN NaN NaN NaN NaN NaN
1   Tname16C00001.1 feature region  1   12018   .   +   .   NaN NaN NaN NaN NaN NaN NaN FALSE   whole genome shotgun linear WGS contig 1    taxon:1907535,MaGe/Organism_id:12155,MaGe/Spec...   genomic DNA Candidatus Thiosymbion hypermnestrae    Strain  NaN
2   Tname16C00001.1 feature gene    943 1992    .   +   .   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Tname16_v1_10001
...

1 个答案:

答案 0 :(得分:1)

您可以在列表理解中创建词典列表,然后创建新的DataFrame,其列名将被DataFrame.add_prefix更改,以免重复的列名(如果由DataFrame.join添加到原始列中):

L = [dict([y.split('=') for y in x.split(';')]) for x in df.pop('attr')]
df = df.join(pd.DataFrame(L, index=df.index).add_prefix('attr.'))

print (df)

     scaffold name      source    type  start   stop score strand phase  \
0  Tname16C00001.1  annotation  remark      1  12018     .      .     .   
1  Tname16C00001.1     feature  region      1  12018     .      +     .   
2  Tname16C00001.1     feature    gene    943   1992     .      +     .   
3  Tname16C00001.1     feature     CDS    943   1992     .      +     0   
4  Tname16C00001.1     feature    gene   1964   2179     .      -     .   
5  Tname16C00001.1     feature     CDS   1964   2179     .      -     0   
6  Tname16C00001.1     feature    gene   2386   3897     .      +     .   
7  Tname16C00001.1     feature     CDS   2386   3897     .      +     0   
8  Tname16C00001.1     feature    gene   3998   5152     .      +     .   
9  Tname16C00001.1     feature     CDS   3998   5152     .      +     0   

    attr.accession                                       attr.comment  ...  \
0  Tname16C00001.1  Annotations were generated from the MicroScope...  ...   
1              NaN                                                NaN  ...   
2              NaN                                                NaN  ...   
3              NaN                                                NaN  ...   
4              NaN                                                NaN  ...   
5              NaN                                                NaN  ...   
6              NaN                                                NaN  ...   
7              NaN                                                NaN  ...   
8              NaN                                                NaN  ...   
9              NaN                                                NaN  ...   

     attr.locus_tag   attr.ID                    attr.inference  \
0               NaN       NaN                               NaN   
1               NaN       NaN                               NaN   
2  Tname16_v1_10001       NaN                               NaN   
3  Tname16_v1_10001  71429338  ab initio prediction:AMIGene:2.0   
4  Tname16_v1_10002       NaN                               NaN   
5  Tname16_v1_10002  71429339  ab initio prediction:AMIGene:2.0   
6  Tname16_v1_10003       NaN                               NaN   
7  Tname16_v1_10003  71429340  ab initio prediction:AMIGene:2.0   
8  Tname16_v1_10004       NaN                               NaN   
9  Tname16_v1_10004  71429341  ab initio prediction:AMIGene:2.0   

                                           attr.note  \
0                                                NaN   
1                                                NaN   
2                                                NaN   
3                      Evidence 5 : Unknown function   
4                                                NaN   
5                      Evidence 5 : Unknown function   
6                                                NaN   
7                                                NaN   
8                                                NaN   
9  Evidence 2a : Function from experimental evide...   

                                        attr.product attr.transl_table  \
0                                                NaN               NaN   
1                                                NaN               NaN   
2                                                NaN               NaN   
3                        protein of unknown function                11   
4                                                NaN               NaN   
5                        protein of unknown function                11   
6                                                NaN               NaN   
7  UDP-glucose:undecaprenyl-phosphate glucose-1-p...                11   
8                                                NaN               NaN   
9               UDP-N-acetyl glucosamine-2-epimerase                11   

  attr.translation attr.gene attr.ec_number  \
0              NaN       NaN            NaN   
1              NaN       NaN            NaN   
2              NaN       NaN            NaN   
3               MG       NaN            NaN   
4              NaN       NaN            NaN   
5               MI       NaN            NaN   
6              NaN      wcaJ            NaN   
7               MY      wcaJ       2.7.8.31   
8              NaN      rffE            NaN   
9               MT      rffE       5.1.3.14   

                                       attr.function  
0                                                NaN  
1                                                NaN  
2                                                NaN  
3                                                NaN  
4                                                NaN  
5                                                NaN  
6                                                NaN  
7                                                NaN  
8                                                NaN  
9  1.6.4 : Enterobacterial common antigen %28surf...  

[10 rows x 30 columns]