我有一本具有以下结构的字典:
{'OPPHJFPK_00001': ['K00879', 'PF00370.22'],
'OPPHJFPK_00002': ['', 'PF01070.19', 'COG1304'],
'OPPHJFPK_00003': ['', 'COG3279', 'GH65'],
'OPPHJFPK_00004': ['', 'PF13460.7', 'COG0451'],
'OPPHJFPK_00005': ['']}
我的目标是获得一个数据框,其中每个功能(始终以K,P,C或G开头)都在右列中:
| OPPHJFPK_00001 | K00879 | PF00370.22 | | |
| OPPHJFPK_00002 | | PF01070.19 | COG1304 | |
| OPPHJFPK_00003 | | | COG3279 | GH65 |
| OPPHJFPK_00004 | | PF13460.7 | | |
| OPPHJFPK_00005 | | | | GTA |
我已经尝试过:
df = pd.DataFrame.from_dict(d, orient='index')
但是我得到的是未格式化的
| OPPHJFPK_00001 | K00879 | PF00370.22 | |
| OPPHJFPK_00002 | | PF01070.19 | COG1304 |
| OPPHJFPK_00003 | | COG3279 | GH65 |
| OPPHJFPK_00004 | | PF13460.7 | |
| OPPHJFPK_00005 | | GTA | |
有没有熊猫功能可以解决这个问题?
请注意,第一列始终是正确的,因为在字典中缺少该功能时,在其位置为空字符串。对于其余选项,如果不存在,则字典中将没有任何内容。
关于如何解决此问题的任何想法?我会很感激的。
答案 0 :(得分:1)
假设d
是您的dict
s=pd.Series(d).explode()
s=s[s!='']
df=pd.crosstab(index=s.index,columns=s.str[0],values=s,aggfunc='first')
df
col_0 C G K P
row_0
OPPHJFPK_00001 NaN NaN K00879 PF00370.22
OPPHJFPK_00002 COG1304 NaN NaN PF01070.19
OPPHJFPK_00003 COG3279 GH65 NaN NaN
OPPHJFPK_00004 COG0451 NaN NaN PF13460.7
答案 1 :(得分:0)
尝试一下:
data = {'OPPHJFPK_00001': ['K00879', 'PF00370.22',''],
'OPPHJFPK_00002': ['', 'PF01070.19', 'COG1304'],
'OPPHJFPK_00003': ['', 'COG3279', 'GH65'],
'OPPHJFPK_00004': ['', 'PF13460.7', 'COG0451'],
'OPPHJFPK_00005': ['','','']}
pd.DataFrame.from_dict(data)
然后,您可以使用DataFrame.transpose()
答案 2 :(得分:0)
另一种解决方案是重塑字典:
a = {'OPPHJFPK_00001': ['K00879', 'PF00370.22'],
'OPPHJFPK_00002': ['', 'PF01070.19', 'COG1304'],
'OPPHJFPK_00003': ['', 'COG3279', 'GH65'],
'OPPHJFPK_00004': ['', 'PF13460.7', 'COG0451'],
'OPPHJFPK_00005': ['']}
# Reshape it so that each value is a duct of {letter: value}
a = {k: {x[0]: x for x in v if x} for k, v in a.items()}
# And then take care of those empty values
a = {k: v if v else {'K': float('nan')} for k, v in a.items()}
答案 3 :(得分:0)
要获得预期的输出,字典必须具有以下格式:
d = {'OPPHJFPK_00001': ['K00879', 'PF00370.22', '', ''],
'OPPHJFPK_00002': ['', 'PF01070.19', 'COG1304', ''],
'OPPHJFPK_00003': ['', '', 'COG3279', 'GH65'],
'OPPHJFPK_00004': ['', 'PF13460.7', '', ''],
'OPPHJFPK_00005': ['','','', 'GTA']}
df = pd.DataFrame.from_dict(d, orient='index')
您正在获得此格式,因为您的数组长度不同。