我正在尝试从具有多个分号分隔对的pandas列中提取特定值(以键:值对的形式)。
输入数据框如下所示:
9 114188457 114192289 cast_3_930|cast_1_1069|cast_2_985 0.9510007336163186 - 114188457 114188457 211,111,111 "gene_id ""ENSMUSG00000111734""; gene_version ""1""; transcript_id ""ENSMUST00000214237""; transcript_version ""1""; exon_number ""23""; gene_name ""Gm29825""; gene_source ""havana""; gene_biotype ""lincRNA""; havana_gene ""OTTMUSG00000062514""; havana_gene_version ""1""; transcript_name ""Gm29825-201""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152298""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001401544""; exon_version ""1""; tag ""basic""; transcript_support_level ""5"";" .
9 114227850 114241851 cast_3_932|cast_1_1071|cast_2_988 1.2516483862692769 + 114227850 114227850 211,111,111 "gene_id ""ENSMUSG00000064299""; gene_version ""6""; transcript_id ""ENSMUST00000213446""; transcript_version ""1""; exon_number ""26""; gene_name ""4921528I07Rik""; gene_source ""ensembl_havana""; gene_biotype ""processed_transcript""; havana_gene ""OTTMUSG00000062515""; havana_gene_version ""1""; transcript_name ""4921528I07Rik-202""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152299""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001400969""; exon_version ""1""; tag ""basic""; transcript_support_level ""1"";" .
9 114227850 114241851 cast_3_932|cast_1_1071|cast_2_988 1.2516483862692769 + 114227850 114227850 211,111,111 "gene_id ""ENSMUSG00000064299""; gene_version ""6""; transcript_id ""ENSMUST00000213446""; transcript_version ""1""; exon_number ""25""; gene_name ""4921528I07Rik""; gene_source ""ensembl_havana""; gene_biotype ""processed_transcript""; havana_gene ""OTTMUSG00000062515""; havana_gene_version ""1""; transcript_name ""4921528I07Rik-202""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152299""; havana_transcript_version ""1""; exon_id ""ENSMUSE00001404576""; exon_version ""1""; tag ""basic""; transcript_support_level ""1"";" .
我正在开发第10列,看起来像这样:
"gene_id ""ENSMUSG00000111734""; gene_version ""1""; transcript_id ""ENSMUST00000214237""; transcript_version ""1""; gene_name ""Gm29825""; gene_source ""havana""; gene_biotype ""lincRNA""; havana_gene ""OTTMUSG00000062514""; havana_gene_version ""1""; transcript_name ""Gm29825-201""; transcript_source ""havana""; transcript_biotype ""lincRNA""; havana_transcript ""OTTMUST00000152298""; havana_transcript_version ""1""; tag ""basic""; transcript_support_level ""5"";"
使用格式对:identifier ""value""
虽然我可以通过将该列转换为另一个数据帧并选择相关行来提取值,但问题是该列中的数据本身未正确排序。
在这种情况下,我只对gene_id
,gene_name
和gene_biotype
感兴趣,但将来可能会改变所需条款的规格。
我本来可以使用基于字典的解决方案,但是每行的值都不是唯一的,而在某些行中它们根本不存在(第10列的.
行)。
最终,我希望数据框看起来像这样:
9 114188457 114192289 cast_3_930|cast_1_1069|cast_2_985 0.9510007336163186 - 114188457 114188457 211,111,111 ENSMUSG00000111734 Gm29825 lincRNA .
9 114227850 114241851 cast_3_932|cast_1_1071|cast_2_988 1.2516483862692769 + 114227850 114227850 211,111,111 ENSMUSG00000064299 4921528I07Rik processed_transcript .
9 114227850 114241851 cast_3_932|cast_1_1071|cast_2_988 1.2516483862692769 + 114227850 114227850 211,111,111 ENSMUSG00000064299 4921528I07Rik processed_transcript .
在熊猫中这样做最有效的方法是什么?
答案 0 :(得分:1)
您可以在列
上的.str
参数后使用正则表达式
df['gene_id'] = df.iloc[:,9].str.extract('gene_id \"(\w+)\";')
df['gene_name'] = df.iloc[:,9].str.extract('gene_name \"(\w+)\";')
df['gene_biotype'] =df.iloc[:,9].str.extract('gene_biotype \"(\w+)\";')