Question

因此，基本思想是通过使用beautifulsoup删除HTML标记和脚本，对某些列表URL进行get请求并解析这些页面源中的文本。 python 2.7版

问题是，在每个请求中，解析器函数都会在每个请求中不断添加内存。大小逐渐增加。

  if(isCustomerQuestion) {
      if (customerId == -1) {
          $.ajax({
              method: "POST",
              async: false,
              url: urlCustomerCreate, 
              success: function (ajaxData) {
                  customerId = ajaxData.NumericValue;
              }
          });
      } 
  }

甚至在本地文本文件中也可用于分析内存泄漏。例如：

def get_text_from_page_source(page_source):
    soup = BeautifulSoup(open(page_source),'html.parser')
#     soup = BeautifulSoup(page_source,"lxml")
    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.decompose()    # rip it out
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    # print text
    return text

Answer 1

您可以尝试调用垃圾收集器：

import gc
response.close()
response = None
gc.collect()

这也可以帮助您：Python high memory usage with BeautifulSoup

Answer 2

您可以在结束soup.decompose函数破坏树之前尝试运行get_text_from_page_source。

如果您打开的是文本文件，而不是直接提供请求内容，如此处所示：

soup = BeautifulSoup(open(page_source),'html.parser')

完成后请记住将其关闭。为了简短起见，您可以将该行更改为：

with open(page_source, 'r') as html_file:
    soup = BeautifulSoup(html_file.read(),'html.parser')

使用BeautifulSoup和Requests解析html页面源代码时出现内存泄漏

2 个答案: