我有这个递归函数,它找到所有hrefs链接,然后每个链接找到链接(递归):
def getLinks(pageUrl):
try:
html = requests.get(enlance, verify=False)
except Exception as e:
print(e)
bsObj = BeautifulSoup(html.text, "html.parser", from_encoding="iso-8859-1")
for link in bsObj.find_all("a", href=re.compile("^(/)")):
page = link.get('href')
pages.add(page)
getLinks(page)
getLinks("")
这个问题是ram内存耗得如此之快。
如何解决此内存消耗?
答案 0 :(得分:0)
有些事情可能会有所帮助:
visited_links = []
unvisited_links = [pageURL]
while (unvisited_links) > 0:
link = unvisited_links.pop()
if link not in visited_links:
visited_links.push(link)
for each new_link in page at link: # pseudocode
if new_link not in visited_links:
unvisited_links.push(new_link)
#At this point, visited_links has your list of links
这不仅避免了递归引入的开销,而且避免了重新访问链接(这可能比检查更慢;当然更浪费内存)。