从字符串列创建虚拟变量

时间:2018-08-17 18:32:15

标签: python pandas data-structures data-science dummy-variable

我有一个熊猫数据框(N = 1485),看起来像这样:

ID          Intervention
1           Blood Draw, Flushed, Locked
1           Blood Draw, Port De-Accessed, Heparin-Locked, Tubing Changed
1           Blood Draw, Flushed
2           Blood return Verified, Flushed
2           Cap Changed
3           Port De-Accessed

我希望能够在每个逗号前将每个字符串中的代码虚拟掉,所以它看起来类似于:

ID          Blood Draw          Flushed          Locked      ....
1              Yes                Yes             Yes
1              Yes                No              No
...

谢谢!

3 个答案:

答案 0 :(得分:0)

您可以尝试以下操作:

for event in ['Blood Draw', 'Flushed', 'Locked']:
    df[event] = df['Intervention'].str.contains(event)

将为您提供True / False而不是'Yes'/'No',这在您进行后期处理时可能会更有用。

答案 1 :(得分:0)

SELECT DISTINCT Company_Code 
FROM Company  
ORDER BY CAST(Company_Code AS INT);

要执行上述步骤,请过滤import numpy as np df1=df['Intervention'].str.split(',', expand=True) df2=df1.replace(np.nan, '', regex=True) # Replacing None with blank data pd.concat([pd.get_dummies(df2[col]) for col in df2], axis=1, keys=df2.columns) # Creates dummies for all the columns 列,执行此过程并与原始数据帧合并,以便dummies语句起作用(为所有列创建dummies)。

答案 2 :(得分:0)

您可以使用pd.Series.str.get_dummies和字典映射:

d = {1: 'yes', 0: 'no'}
res = df.join(df.pop('Intervention').str.get_dummies(', ').applymap(d.get))

我认为,最好将其转换为仅用于显示目的的字符串。布尔值可以更有效地按布尔序列进行保存和操作。

结果

print(res)

   ID Blood Draw Blood return Verified Cap Changed Flushed Heparin-Locked  \
0   1        yes                    no          no     yes             no   
1   1        yes                    no          no      no            yes   
2   1        yes                    no          no     yes             no   
3   2         no                   yes          no     yes             no   
4   2         no                    no         yes      no             no   
5   3         no                    no          no      no             no   

  Locked Port De-Accessed Tubing Changed  
0    yes               no             no  
1     no              yes            yes  
2     no               no             no  
3     no               no             no  
4     no               no             no  
5     no              yes             no  

设置

df = pd.DataFrame({'ID': [1, 1, 1, 2, 2, 3],
                   'Intervention': ['Blood Draw, Flushed, Locked',
                                    'Blood Draw, Port De-Accessed, Heparin-Locked, Tubing Changed',
                                    'Blood Draw, Flushed', 'Blood return Verified, Flushed',
                                    'Cap Changed', 'Port De-Accessed']})