使用BeautifulSoup和Requests解析html页面源代码时出现内存泄漏

时间:2018-08-17 11:51:04

标签: python memory-leaks beautifulsoup python-requests

因此,基本思想是通过使用beautifulsoup删除HTML标记和脚本,对某些列表URL进行get请求并解析这些页面源中的文本。 python 2.7版

问题是,在每个请求中,解析器函数都会在每个请求中不断添加内存。大小逐渐增加。

  if(isCustomerQuestion) {
      if (customerId == -1) {
          $.ajax({
              method: "POST",
              async: false,
              url: urlCustomerCreate, 
              success: function (ajaxData) {
                  customerId = ajaxData.NumericValue;
              }
          });
      } 
  }

甚至在本地文本文件中也可用于分析内存泄漏。 例如:

def get_text_from_page_source(page_source):
    soup = BeautifulSoup(open(page_source),'html.parser')
#     soup = BeautifulSoup(page_source,"lxml")
    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.decompose()    # rip it out
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    # print text
    return text

enter image description here

2 个答案:

答案 0 :(得分:2)

您可以尝试调用垃圾收集器:

import gc
response.close()
response = None
gc.collect()

这也可以帮助您:Python high memory usage with BeautifulSoup

答案 1 :(得分:0)

您可以在结束soup.decompose函数破坏树之前尝试运行get_text_from_page_source

如果您打开的是文本文件,而不是直接提供请求内容,如此处所示:

soup = BeautifulSoup(open(page_source),'html.parser')

完成后请记住将其关闭。为了简短起见,您可以将该行更改为:

with open(page_source, 'r') as html_file:
    soup = BeautifulSoup(html_file.read(),'html.parser')