Question

我想从格式为

的字典中创建一个数据框

Dictionary_ =  {'Key1': ['a', 'b', 'c', 'd'],'Key2': ['d', 'f'],'Key3': ['a', 'c', 'm', 'n']}

我正在使用

df = pd.DataFrame.from_dict(Dictionary_, orient ='index')

但是它会创建自己的列，直到值的最大长度，并将字典的值作为值放入数据帧中。

我想要一个df，键为行，值为列，如

       a     b      c     d     e     f    m     n 
Key 1  1      1      1    1     0    0    0     0
Key 2  0      0      0    1     0    1    0     0
Key 3  1      0      1    0     0    0    1     1

我可以做到这一点，方法是将dict的所有值附加在一起，并创建一个空的数据帧，将dict键作为行，将值作为列，然后遍历每一行以从dict中获取值，并将其与列匹配的位置放1。太慢了，因为我的数据有20万行，而.loc很慢。我觉得我可以以某种方式使用熊猫假人，但不知道如何在此处应用它。

我认为将会有一种更明智的方式。

Answer 1

如果性能很重要，请使用MultiLabelBinarizer并通过keys和values：

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(Dictionary_.values()),
                  columns=mlb.classes_, 
                  index=Dictionary_.keys()))
print (df)
      a  b  c  d  f  m  n
Key1  1  1  1  1  0  0  0
Key2  0  0  0  1  1  0  0
Key3  1  0  1  0  0  1  1

另一种方法是创建Series，然后创建string的{{3}}，最后一次调用str.join则更慢：

df = pd.Series(Dictionary_).str.join('|').str.get_dummies()
print (df)
      a  b  c  d  f  m  n
Key1  1  1  1  1  0  0  0
Key2  0  0  0  1  1  0  0
Key3  1  0  1  0  0  1  1

使用输入DataFrame的替代方法-使用str.get_dummies，但必须按列汇总max：

df1 = pd.DataFrame.from_dict(Dictionary_, orient ='index')

df = pd.get_dummies(df1, prefix='', prefix_sep='').max(axis=1, level=0)
print (df)
      a  d  b  c  f  m  n
Key1  1  1  1  1  0  0  0
Key2  0  1  0  0  1  0  0
Key3  1  0  0  1  0  1  1

Answer 2

使用get_dummies：

>>> pd.get_dummies(df).rename(columns=lambda x: x[2:]).max(axis=1, level=0)
      a  d  b  c  f  m  n
Key1  1  1  1  1  0  0  0
Key2  0  1  0  0  1  0  0
Key3  1  0  0  1  0  1  1
>>>

将dict值（列表）作为列将python字典转换为数据框，如果该列在dict列表中则将1,0转换为数据框

2 个答案: