我想根据短语存在创建新列
这是我的数据
No Body
1 Office software is already paid
2 Excel software is not paid yet
3 Power point software is already paid
我想根据某些短语的存在进行分类,这是我的代码,
countries1 = df.body.str.extract('(software|is already paid)', expand = False)
dummies1 = pd.get_dummies(countries1)
df_1 = pd.concat([df,dummies1],axis = 1)
结果是
No Body software is already paid
1 Office software is already paid 0 1
2 Excel software is not paid yet 1 0
3 Power point software is already paid 0 1
我的期望
No Body software is already paid
1 Office software is already paid 1 1
2 Excel software is not paid yet 1 0
3 Power point software is already paid 1 1
我的代码有什么问题?或者我可能没有使用正确的功能
答案 0 :(得分:3)
让我们尝试使用extractall
:
df.assign(**df.Body.str.extractall('(software|is already paid)')[0]
.str.get_dummies().sum(level=0))
输出:
No Body is already paid software
0 1 Office software is already paid 1 1
1 2 Excel software is not paid yet 0 1
2 3 Power point software is already paid 1 1
答案 1 :(得分:2)
您可以使用Numpy的np.core.defchararray.find
来查找短语
from numpy.core.defchararray import find
phrases = np.array(['software', 'is already paid'])
dummies = (find(
df.Body.values.astype(str),
phrases[:, None]) > -1
).astype(np.uint)
dummies
array([[1, 1, 1],
[1, 0, 1]], dtype=uint64)
然后,您可以将值放入现有数据框
df['software'], df['is already paid'] = dummies
或使用assign
并创建包含所需列的新副本
df.assign(**dict(zip(phrases, dummies)))
No Body software is already paid
0 1 Office software is already paid 1 1
1 2 Excel software is not paid yet 1 0
2 3 Power point software is already paid 1 1