Question

所以我一直在研究10k +页面上的刮刀并从中抓取数据。

问题是随着时间的推移，内存消耗急剧增加。所以要克服这个问题，而不是仅在scrape scraper结束时关闭驱动程序实例，以便在加载每个页面并提取数据后关闭实例。

但由于某些原因，ram记忆仍然存在。

我尝试使用PhantomJS但由于某种原因它没有正确加载数据。我还尝试使用初始版本的scraper将Firefox中的缓存限制为100mb，这也不起作用。

注意：我使用chromedriver和firefox实例运行测试，不幸的是我无法使用诸如请求，机械化等库而不是selenium。

任何帮助都表示赞赏，因为我现在已经试图解决这个问题一周了。感谢。

Answer 1

您是否想说您的司机正在填补您的记忆？你是怎么关闭他们的？如果你要提取数据，你是否仍然引用了一些将它们存储在内存中的集合？

你提到当你在抓取结束时关闭驱动程序实例时你已经没有内存了，这使你看起来像是在保留额外的引用。

Answer 2

The only way强制Python解释器释放内存到操作系统是终止进程。因此，使用multiprocessing生成selenium Firefox实例;当生成的进程终止时，将释放内存：

import multiprocessing as mp
import selenium.webdriver as webdriver

def worker()
    driver = webdriver.Firefox()
    # do memory-intensive work
    # closing and quitting is not what ultimately frees the memory, but it
    # is good to close the WebDriver session gracefully anyway.
    driver.close()
    driver.quit()

if __name__ == '__main__':
    p = mp.Process(target=worker)
    # run `worker` in a subprocess
    p.start()
    # make the main process wait for `worker` to end
    p.join()
    # all memory used by the subprocess will be freed to the OS

另见Why doesn't Python release the memory when I delete a large object?

Answer 3

我遇到过类似的问题，并且破坏了我的驱动程序（即将驱动程序设置为None）可以防止这些内存泄漏给我

Answer 4

我遇到了同样的问题，直到将webdriver.get(url)语句放入try / except / finally语句中，并确保webdriver.quit()在finally语句中为止，这样，它始终会执行。喜欢：

webdriver = webdriver.Firefox()
try:
        webdriver.get(url)
        source_body = webdriver.page_source
except Exception as e:
        print(e)
finally:
        webdriver.quit()

来自docs：

此类语句的finally子句可用于指定清理代码不处理异常，但无论是否之前的代码中是否发生了异常。

即使在调用close / quit之后，Selenium也没有释放内存

4 个答案: