如何获得特定单词fron str.contains

时间:2019-06-01 19:08:50

标签: python-3.x pandas

我有一个带有ID和文本字符串的熊猫数据框。 我正在尝试使用str.contains对记录进行分类 我需要str.contains代码在不同列中标识的文本字符串中的单词。我正在使用python 3和pandas 我的df如下:

ID  Text
1   The cricket world cup 2019 has begun
2   I am eagrly waiting for the cricket worldcup 2019 
3   I will try to watch all the mathes my favourite teams playing in the cricketworldcup 2019
4   I love cricket to watch and badminton to play


searchfor = ['cricket','world cup','2019']
 df['text'].str.contains('|'.join(searchfor))

ID  Text                                    phrase1 phrase2    phrase3
1   The cricket world cup 2019 has begun    cricket  world cup 2019
2   I am eagrly waiting for the 
cricket worldcup 2019                           cricket world cup   2019
3   I will try to watch all the mathes my 
favourite teams playing in the 
cricketworldcup 2019                           cricket  world cup   2019
4   I love cricket to watch and badminton 
to play                                        cricket

2 个答案:

答案 0 :(得分:1)

您可以使用np.where

zeit

import numpy as np search_for = ['cricket', 'world cup', '2019'] for word in search_for: df[word] = np.where(df.text.str.contains(word), word, np.nan) df text cricket world cup 2019 1 The cricket world cup 2019 has begun cricket world cup 2019 2 I am eagrly waiting for the cricket worldcup 2019 cricket nan 2019 3 I will try to watch all the mathes my favourit... cricket nan 2019 4 I love cricket to watch and badminton to play cricket nan nan 的语法:np.where。如果条件为True,则返回x,否则返回y

答案 1 :(得分:1)

诀窍是使用str.findall而不是str.contains来获取所有匹配短语的列表。然后,只需将数据框调整为所需的格式即可。

这是您的起点:

df = pd.DataFrame(
    [
        'The cricket world cup 2019 has begun',
        'I am eagrly waiting for the cricket worldcup 2019',
        'I will try to watch all the mathes my favourite teams playing in the cricketworldcup 2019',
        'I love cricket to watch and badminton to play',
    ],
    index=pd.Index(range(1, 5), name="ID"),
    columns=["Text"]
)
searchfor = ['cricket','world cup','2019']

这是示例解决方案:

pattern = "(" + "|".join(searchfor) + ")"
matches = (
    df.Text.str.findall(pattern)
    .apply(pd.Series)
    .stack()
    .reset_index(-1, drop=True)
    .to_frame("phrase")
    .assign(match=True)
)

#        phrase  match
# ID                  
# 1     cricket   True
# 1   world cup   True
# 1        2019   True
# 2     cricket   True
# 2        2019   True
# 3     cricket   True
# 3        2019   True
# 4     cricket   True

您还可以重新格式化数据框,使每个短语具有单独的列:

matches.pivot(columns="phrase", values="match").fillna(False)

# phrase   2019  cricket  world cup
# ID                               
# 1        True     True       True
# 2        True     True      False
# 3        True     True      False
# 4       False     True      False