选项1

Question

在虚构的患者数据集中，可能会遇到下表：

pd.DataFrame({
    "Patients": ["Luke", "Nigel", "Sarah"],
    "Disease": ["Cooties", "Dragon Pox", "Greycale & Cooties"]
})

哪个呈现以下数据集：

现在，假设有多种疾病的行使用相同的模式（用字符分隔，在这种情况下为&），并且存在疾病的完整列表diseases，我尚未找到一种适用于这些情况的简单解决方案pandas.get_dummies单热编码器，以获取每个患者的二进制矢量。

如何以最简单的方式从初始DataFrame获得以下二进制矢量化？

pd.DataFrame({
    "Patients": ["Luke", "Nigel", "Sarah"],
    "Cooties":[1, 0, 1],
    "Dragon Pox":[0, 1, 0],
    "Greyscale":[0, 0, 1]
})

Answer 1

您可以将Series.str.get_dummies与右分隔符一起使用

df.set_index('Patients')['Disease'].str.get_dummies(' & ').reset_index()

    Patients    Cooties Dragon Pox  Greycale
0   Luke        1       0           0
1   Nigel       0       1           0
2   Sarah       1       0           1

Answer 2

我们可以使用this函数将您的字符串嵌套到行中。

此后，我们将pivot_table与def nested_loops(<arguments>): for ai in <some_sequence>: for bi in range(<based on ai>): if <condition>: return ai, bi一起使用：

aggfunc=len

df = explode_str(df, 'Disease', ' & ')

print(df)
  Patients     Disease
0     Luke     Cooties
1    Nigel  Dragon Pox
2    Sarah    Greycale
2    Sarah     Cooties

链接答案中使用的功能：

df.pivot_table(index='Patients', columns='Disease', aggfunc=len)\
  .fillna(0).reset_index()

Disease Patients  Cooties  Dragon Pox  Greycale
0           Luke      1.0         0.0       0.0
1          Nigel      0.0         1.0       0.0
2          Sarah      1.0         0.0       1.0

Answer 3

选项1

您可以循环检查disease中df['Disease']的出现：

>>> diseases = ['Cooties', 'Dragon Pox', 'Greycale']
>>> for disease in diseases:
>>>     df[disease] = pd.Series(val == disease for val in df['Disease'].values).astype(int)

选项2

或者，您可以将.get_dummies中的字符串除以df['Disease']后再使用'& '。

>>> sub_df = df['Disease'].str.split('& ', expand=True)
>>> dummies = pd.get_dummies(sub_df)
>>> dummies

#    0_Cooties  0_Dragon Pox  0_Greycale   1_Cooties
# 0          1             0            0          0
# 1          0             1            0          0
# 2          0             0            1          1

# Let's rename the columns by taking only the text after the '_'
>>> _, dummies.columns = zip(*dummies.columns.str.split('_'))
>>> dummies.groupby(dummies.columns, axis=1).sum()

#      Cooties  Dragon Pox   Greycale 
#   0        1           0          0
#   1        0           1          0
#   2        1           0          1

二进制矢量化熊猫DataFrame列

3 个答案:

选项1

选项2