Question

背景信息：每周，我都会收到一份html文件形式的实验室结果列表。每周，大约有3,000个结果，每组结果都有两到四个与之相关的表。对于每个结果/试验，我只关心存储在其中一个表中的一些标准信息。该表可以唯一标识，因为第一个单元格，第一列始终具有文本“Lab Results”。

问题：以下代码在我每次执行每个文件时效果很好。也就是说，我没有在目录上执行for循环，而是将get_data = open（）指向特定文件。但是，我想从过去几年中获取数据，而不是单独执行每个文件。因此，我使用glob模块和for循环遍历目录中的所有文件。我遇到的问题是，当我到达目录中的第三个文件时，我得到了一个MemoryError。

问题：有没有办法清除/重置每个文件之间的内存？这样，我可以遍历目录中的所有文件，而不是单独粘贴每个文件名。正如您在下面的代码中看到的，我尝试使用del清除变量，但这不起作用。

谢谢。

from bs4 import BeautifulSoup
import glob
import gc

for FileName in glob.glob("\\Research Results\\*"):

    get_data = open(FileName,'r').read()

    soup = BeautifulSoup(get_data)

    VerifyTable = "Clinical Results"

    tables = soup.findAll('table')

    for table in tables:
        First_Row_First_Column = table.findAll('tr')[0].findAll('td')[0].text
        if VerifyTable == First_Row_First_Column.strip():
            v1 = table.findAll('tr')[1].findAll('td')[0].text
            v2 = table.findAll('tr')[1].findAll('td')[1].text

            complete_row = v1.strip() + ";" + v2.strip()

            print (complete_row)

            with open("Results_File.txt","a") as out_file:
                out_string = ""
                out_string += complete_row
                out_string += "\n"
                out_file.write(out_string)
                out_file.close()

    del get_data
    del soup
    del tables
    gc.collect()

print ("done")

Answer 1

我是一个非常初学的程序员，我遇到了同样的问题。我做了三件似乎可以解决问题的事情：

还在迭代开始时调用垃圾收集（'gc.collect（）'）
转换迭代上的解析，因此所有全局变量将成为局部变量，并将在函数末尾删除。
使用soupe.decompose（）

我认为第二个变化可能解决了它，但我没有时间检查它，我不想更改正常工作的代码。

对于此代码，解决方案将是这样的：

from bs4 import BeautifulSoup
import glob
import gc

def parser(file):
    gc.collect()

    get_data = open(file,'r').read()

    soup = BeautifulSoup(get_data)
    get_data.close()
    VerifyTable = "Clinical Results"

    tables = soup.findAll('table')

    for table in tables:
        First_Row_First_Column = table.findAll('tr')[0].findAll('td')[0].text
        if VerifyTable == First_Row_First_Column.strip():
            v1 = table.findAll('tr')[1].findAll('td')[0].text
            v2 = table.findAll('tr')[1].findAll('td')[1].text

            complete_row = v1.strip() + ";" + v2.strip()

            print (complete_row)

            with open("Results_File.txt","a") as out_file:
                out_string = ""
                out_string += complete_row
                out_string += "\n"
                out_file.write(out_string)
                out_file.close()

    soup.decompose()
    gc.collect()
    return None


for filename in glob.glob("\\Research Results\\*"):
    parser(filename)

print ("done")

打开目录中的多个文件时出现BeautifulSoup MemoryError

1 个答案: