Question

我有两个数据帧如下：

df1 : contains one variable ['search_term'] and 100000 rows

这些是我想在我的文件中搜索的字词/短语

df2: contains parsed file contents in a column called file_text

此数据框中有20000行和两列['file_name'，'file_text']

我需要的是file_text中搜索词的每个外观的索引。

我无法找到执行此搜索的有效方法。

我正在使用str.find（）函数和groupby，但是每个file_text-search术语大约需要0.25s（对于20k文件* 100k搜索术语来说它变得非常长）

任何有关如何以快速有效的方式做到这一点的想法都将成为救星！

Answer 1

我记得在我们的一个项目中必须做类似的事情。我们有一组非常大的关键字，我们希望在一个大字符串中搜索它们，并找到所有这些关键字。让我们在content中调用我们想要搜索的字符串。经过一些基准测试后，我采用的解决方案是双向传递方法：首先使用高度优化的content运算符检查in中是否存在关键字，然后使用正则表达式查找所有关键字它的出现。

import re

keywords = [...list of your keywords ...]
found_keywords = []

for keyword in keywords:
    if keyword in content:
        found_keywords.append(keyword)

for keyword in found_keywords:
    for match in re.finditer(keyword, content):
        print(match.start())

在多个文件中搜索多个子字符串的索引

1 个答案: