如何根据短语存在创建新列?

时间:2018-02-19 04:54:38

标签: python pandas text extract feature-detection

我想根据短语存在创建新列

这是我的数据

No   Body
1    Office software is already paid
2    Excel software is not paid yet
3    Power point software is already paid

我想根据某些短语的存在进行分类,这是我的代码,

countries1 = df.body.str.extract('(software|is already paid)', expand = False)
dummies1 = pd.get_dummies(countries1)
df_1 = pd.concat([df,dummies1],axis = 1)

结果是

No   Body                                   software   is already paid    
1    Office software is already paid        0          1
2    Excel software is not paid yet         1          0
3    Power point software is already paid   0          1

我的期望

No   Body                                   software   is already paid    
1    Office software is already paid        1          1
2    Excel software is not paid yet         1          0
3    Power point software is already paid   1          1

我的代码有什么问题?或者我可能没有使用正确的功能

2 个答案:

答案 0 :(得分:3)

让我们尝试使用extractall

df.assign(**df.Body.str.extractall('(software|is already paid)')[0]
              .str.get_dummies().sum(level=0))

输出:

   No                                  Body  is already paid  software
0   1       Office software is already paid                1         1
1   2        Excel software is not paid yet                0         1
2   3  Power point software is already paid                1         1

答案 1 :(得分:2)

您可以使用Numpy的np.core.defchararray.find来查找短语

from numpy.core.defchararray import find

phrases = np.array(['software', 'is already paid'])

dummies = (find(
    df.Body.values.astype(str),
    phrases[:, None]) > -1
).astype(np.uint)

dummies

array([[1, 1, 1],
       [1, 0, 1]], dtype=uint64)

然后,您可以将值放入现有数据框

df['software'], df['is already paid'] = dummies

或使用assign并创建包含所需列的新副本

df.assign(**dict(zip(phrases, dummies)))

   No                                  Body  software  is already paid
0   1       Office software is already paid         1                1
1   2        Excel software is not paid yet         1                0
2   3  Power point software is already paid         1                1