Question

我对python数据帧有点陌生，所以听起来很简单。我在数据框中有一个名为“ body_text”的列，我想看看body_text的每一行是否包含单词“ Hello”。如果可以的话，我想创建另一个具有1或0作为值的列。

我尝试使用str.contains("Hello")，但发生错误，因为它只选择了具有“ Hello”的行，并试图将其放在另一列中。我尝试查看其他最终导致更多错误的解决方案-循环和str中的str。

textdf = traindf[['request_title','request_text_edit_aware']]

traindf是一个巨大的数据框，我只从中提取了2列

Answer 1

如果您的匹配区分大小写，请使用Series.str.contains并在.astype上链接以强制转换为int：

df['contains_hello'] = df['body_text'].str.contains('Hello').astype(int)

如果应该匹配，不区分大小写，请添加case=False参数：

df['contains_hello'] = df['body_text'].str.contains('Hello', case=False).astype(int)

更新

如果需要匹配多个模式，请使用regex和| （'OR'）字符。根据您的要求，您可能还需要一个“单词边界” 字符。

如果您想了解有关regex模式和字符类的更多信息，

Regexr是很好的资源。

示例

df = pd.DataFrame({'body_text': ['no matches here', 'Hello, this should match', 'high low - dont match', 'oh hi there - match me']})

#                      body_text
#    0           no matches here   
#    1  Hello, this should match   <--  we want to match this 'Hello'
#    2     high low - dont match   <-- 'hi' exists in 'high', but we don't want to match it
#    3    oh hi there - match me   <--  we want to match 'hi' here

df['contains_hello'] = df['body_text'].str.contains(r'Hello|\bhi\b', regex=True).astype(int)

                  body_text  contains_hello
0           no matches here               0
1  Hello, this should match               1
2     high low - dont match               0
3    oh hi there - match me               1

有时候，使用list个要匹配的单词对使用python list comprehension更容易地创建regex模式很有用。例如：

match = ['hello', 'hi']    
pat = '|'.join([fr'\b{x}\b' for x in match])
# '\bhello\b|\bhi\b'  -  meaning 'hello' OR 'hi'

df.body_text.str.contains(pat)

Answer 2

使用您在问题中定义的textdf，尝试：

textdf['new_column'] = [1 if t == 'Hello' else 0 for t in textdf['body_text'] ]

Answer 3

您可以在Panda中使用get_dummies()功能。

Here是文档的链接。

Python Pandas遍历整列，并检查其是否包含特定的str

3 个答案:

更新

示例