Question

我有一个csv文件，我读入了pandas数据帧。有两个特定的列，'Notes'和'ActivityType'，我想用作标准。如果'Notes'列包含字符串值'Morning exercise'或'Morning exercise'和/或'ActivityType'列包含任何字符串值（大多数单元格为空，我不想计算Null值），那么make一个新列'MorningExercise'，如果满足任一条件，则插入1;如果两者都不满，则插入0。

我一直在使用下面的代码创建一个新列，如果在'Notes'列中满足文本条件，则插入1或0，但我还没弄清楚如果'ActivityType'包含1 column包含任何字符串值。

JoinedTables['MorningExercise'] = JoinedTables['Notes'].str.contains(('Morning workout' or 'Morning exercise'), case=False, na=False).astype(int)

对于'ActivityType'列，我认为使用pd.notnull()函数作为批评。

我真的只需要在python中查看是否连续符合任何条件，如果是，则在新列中输入1或0。

Answer 1

您需要制作一个与str.contains一起使用的正则表达式模式：

regex = r'Morning\s*(?:workout|exercise)'
JoinedTables['MorningExercise'] = \
       JoinedTables['Notes'].str.contains(regex, case=False, na=False).astype(int)

<强>详情

Morning       # match "Morning"
\s*           # 0 or more whitespace chars
(?:           # open non-capturing group
workout       # match "workout" 
|             # OR operator
exercise      # match "exercise"
)

该模式将查找Morning，然后是workout 或 exercise。

在pandas数据框中选择遵循特定模式的行

1 个答案: