因此,基本思想是通过使用beautifulsoup删除HTML标记和脚本,对某些列表URL进行get请求并解析这些页面源中的文本。 python 2.7版
问题是,在每个请求中,解析器函数都会在每个请求中不断添加内存。大小逐渐增加。
if(isCustomerQuestion) {
if (customerId == -1) {
$.ajax({
method: "POST",
async: false,
url: urlCustomerCreate,
success: function (ajaxData) {
customerId = ajaxData.NumericValue;
}
});
}
}
甚至在本地文本文件中也可用于分析内存泄漏。 例如:
def get_text_from_page_source(page_source):
soup = BeautifulSoup(open(page_source),'html.parser')
# soup = BeautifulSoup(page_source,"lxml")
# kill all script and style elements
for script in soup(["script", "style"]):
script.decompose() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
# print text
return text
答案 0 :(得分:2)
您可以尝试调用垃圾收集器:
import gc
response.close()
response = None
gc.collect()
答案 1 :(得分:0)
您可以在结束soup.decompose
函数破坏树之前尝试运行get_text_from_page_source
。
如果您打开的是文本文件,而不是直接提供请求内容,如此处所示:
soup = BeautifulSoup(open(page_source),'html.parser')
完成后请记住将其关闭。为了简短起见,您可以将该行更改为:
with open(page_source, 'r') as html_file:
soup = BeautifulSoup(html_file.read(),'html.parser')