Question

使用Pandas，我正在获取一个大型CSV文件，并且只需要包含某些确切字符串的行。我有这个工作，但觉得应该有一个更好的方法，因为我添加更多的搜索条件是非常缓慢和难以维护搜索模式。以下是代码段：

regex_search = '(?:\,|^)EXACT PATTERN 1(?:\,|$)|(?:\,|^)EXACT PATTERN 2(?:\,|$)'
results = df[df['Column_to_search'].str.contains(regex_search)]
#now spit out all the rows where the column had the matched value in new CSV file

我正在使用的正则表达式基本上说：

(?:\,|^)  --> pattern must be preceded by a comma
(?:\,|$)  --> pattern must be followed by a comma
|         --> OR ,so that I can match as many search terms as needed
#...
#df is just the data frame that was loaded via pandas

此列表会导致很多维护问题！我必须获取列表并通过循环运行它以添加正则表达式字符串，然后必须格式化任何需要的新短语。

最初，我的搜索字词为：

regex_search = 'EXACT PATTERN 1|EXACT PATTERN 2'

这更容易维护，但这会导致问题，因为它是正则表达式，它也会匹配大量的误报，特别是对于较小的短语或首字母缩略词。

regex_search变量通常有300多个要搜索的短语，而csv文件有数千行。有没有python函数来做到这一点？也许是这样的：

.str.match_multiple()
#or
regex_search_list = ['ABC','XYZ','ETC']
.str.match_in_list(regex_search_list)

我认为我不能使用.match，因为我的正则表达式字符串有多个值。如果pandas有一种方法可以根据列表匹配列值，我还没有找到它。

思考？有更好的方法吗？

Answer 1

感谢A.Kot的评论，我使用了.isin（），我的脚本从20分钟到大约10秒钟。

新代码：

list_search = ['EXACT PATTERN 1', 'EXACT PATTERN 2']
results = df[df['Column_to_search'].isin(list_search)]

熊猫字符串搜索 - 多个精确值 - 非常慢

1 个答案: