Question

我希望在数千个文本文件中搜索字符串列表（列表中包含2k到10k个字符串）（可能有多达100k文本文件，每个文件大小从1 KB到100 MB）保存在一个文件夹并输出匹配的文本文件名的csv文件。

我已经开发了一个代码来完成所需的工作，但2000字符串需要大约8-9个小时来搜索大约2000个大小约为2.5 GB的文本文件。

此外，通过使用此方法，系统的内存被占用，因此有时需要将2000个文本文件拆分为较小的批处理，以便代码运行。

代码如下（Python 2.7）。

# -*- coding: utf-8 -*-
import pandas as pd
import os

def match(searchterm):
    global result
    filenameText = ''
    matchrateText = ''
    for i, content in enumerate(TextContent):
        matchrate = search(searchterm, content)
        if matchrate:
            filenameText += str(listoftxtfiles[i])+";"
            matchrateText += str(matchrate) + ";"
    result.append([searchterm, filenameText, matchrateText])


def search(searchterm, content):
    if searchterm.lower() in content.lower():
        return 100
    else:
        return 0


listoftxtfiles = os.listdir("Txt/")
TextContent = []
for txt in listoftxtfiles:
    with open("Txt/"+txt, 'r') as txtfile:
        TextContent.append(txtfile.read())

result = []
for i, searchterm in enumerate(searchlist):
    print("Checking for " + str(i + 1) + " of " + str(len(searchlist)))
    match(searchterm)

df=pd.DataFrame(result,columns=["String","Filename", "Hit%"])

以下示例输入。

字符串列表 -

["Blue Chip", "JP Morgan Global Healthcare","Maximum Horizon","1838 Large Cornerstone"]

文字档案 -

包含由\ n

分隔的不同行的常用文本文件

以下示例输出。

String,Filename,Hit%
JP Morgan Global Healthcare,000032.txt;000031.txt;000029.txt;000015.txt;,100;100;100;100;
Blue Chip,000116.txt;000126.txt;000114.txt;,100;100;100;
1838 Large Cornerstone,NA,NA
Maximum Horizon,000116.txt;000126.txt;000114.txt;,100;100;100;

如上例所示，第一个字符串匹配4个文件（分隔为;），第二个字符串匹配3个文件，第三个字符串未匹配任何文件。

有没有更快捷的方式进行搜索而不分割任何文本文件？

Answer 1

您的代码会在内存中大量推送大量数据，因为您将所有文件加载到内存中然后再搜索它们。

除了性能之外，您的代码可能会使用一些清理工作。尝试尽可能自主地编写函数，而不依赖于全局变量（用于输入或输出）。

我使用列表推导重写了你的代码，它变得更加紧凑。

# -*- coding: utf-8 -*-
from os import listdir
from os.path import isfile

def search_strings_in_files(path_str, search_list):
    """ Returns a list of lists, where each inner list contans three fields:
    the filename (without path), a string in search_list and the
    frequency (number of occurences) of that string in that file"""

    filelist = listdir(path_str)

    return [[filename, s, open(path_str+filename, 'r').read().lower().count(s)]
        for filename in filelist
            if isfile(path_str+filename)
                for s in [sl.lower() for sl in search_list] ]

if __name__ == '__main__':
    print search_strings_in_files('/some/path/', ['some', 'strings', 'here'])

我在此代码中使用的机制：

list comprehension循环思索search_lists和文件。
compound statements仅循环遍历目录中的文件（而不是通过子目录）。
method chaining直接调用返回的对象的方法。

阅读列表理解的提示：尝试从下到上阅读它，所以：

我使用列表理解将search_list中的所有项目转换为较低。
然后我循环遍历该列表（for s in...）
然后我使用复合语句（if isfile...）
然后我循环遍历所有文件（for filename...）
在第一行中，我创建了包含三个项目的子列表：
- 文件名
- s，即小写搜索字符串
- 方法链式调用打开文件，读取其所有内容，将其转换为小写并计算s的出现次数。

此代码使用“标准”Python函数中的所有功能。如果您需要更高的性能，您应该研究专门的库来完成这项任务。

如何更快地在文本文件中搜索字符串

1 个答案: