我有一个带有ID和文本字符串的熊猫数据框。 我正在尝试使用str.contains对记录进行分类 我需要str.contains代码在不同列中标识的文本字符串中的单词。我正在使用python 3和pandas 我的df如下:
ID Text
1 The cricket world cup 2019 has begun
2 I am eagrly waiting for the cricket worldcup 2019
3 I will try to watch all the mathes my favourite teams playing in the cricketworldcup 2019
4 I love cricket to watch and badminton to play
searchfor = ['cricket','world cup','2019']
df['text'].str.contains('|'.join(searchfor))
ID Text phrase1 phrase2 phrase3
1 The cricket world cup 2019 has begun cricket world cup 2019
2 I am eagrly waiting for the
cricket worldcup 2019 cricket world cup 2019
3 I will try to watch all the mathes my
favourite teams playing in the
cricketworldcup 2019 cricket world cup 2019
4 I love cricket to watch and badminton
to play cricket
答案 0 :(得分:1)
您可以使用np.where:
zeit
import numpy as np
search_for = ['cricket', 'world cup', '2019']
for word in search_for:
df[word] = np.where(df.text.str.contains(word), word, np.nan)
df
text cricket world cup 2019
1 The cricket world cup 2019 has begun cricket world cup 2019
2 I am eagrly waiting for the cricket worldcup 2019 cricket nan 2019
3 I will try to watch all the mathes my favourit... cricket nan 2019
4 I love cricket to watch and badminton to play cricket nan nan
的语法:np.where
。如果条件为True,则返回x,否则返回y
答案 1 :(得分:1)
诀窍是使用str.findall
而不是str.contains
来获取所有匹配短语的列表。然后,只需将数据框调整为所需的格式即可。
这是您的起点:
df = pd.DataFrame(
[
'The cricket world cup 2019 has begun',
'I am eagrly waiting for the cricket worldcup 2019',
'I will try to watch all the mathes my favourite teams playing in the cricketworldcup 2019',
'I love cricket to watch and badminton to play',
],
index=pd.Index(range(1, 5), name="ID"),
columns=["Text"]
)
searchfor = ['cricket','world cup','2019']
这是示例解决方案:
pattern = "(" + "|".join(searchfor) + ")"
matches = (
df.Text.str.findall(pattern)
.apply(pd.Series)
.stack()
.reset_index(-1, drop=True)
.to_frame("phrase")
.assign(match=True)
)
# phrase match
# ID
# 1 cricket True
# 1 world cup True
# 1 2019 True
# 2 cricket True
# 2 2019 True
# 3 cricket True
# 3 2019 True
# 4 cricket True
您还可以重新格式化数据框,使每个短语具有单独的列:
matches.pivot(columns="phrase", values="match").fillna(False)
# phrase 2019 cricket world cup
# ID
# 1 True True True
# 2 True True False
# 3 True True False
# 4 False True False