Question

我制作了一个Python脚本，该脚本使用Selenium来保存网站特定页面中的数据，这些脚本基本上可以访问网站，以表格形式输入ID，然后下载结果。这是基本代码：

import os, time
from database import id         ## This array contain 30000 different IDs
from scrapper import Scrapper   ## This Class do the magic

start = os.environ["START"]     ## Starting ID, eg: 4500
end = os.environ["END"]         ## Ending ID, eg: 6000

def main(id):
    drive.open(WEBSITE_URI)     ## Open the website
    drive.insert(id)            ## Insert id into the fields
    a, b, c = drive.download()  ## Download the data into a, b and c
    return a, b, c

if __name__ == "__main__":
    drive = Scrapper(HEADLESS)  ## Start Firefox in Headless mode

    while start <= end:
        x, y, z = main(id[i], scrap)
        print(x, y, z)
        i += 1

    finally:
        drive.browser.close()
        drive.browser.quit()

每个循环周期（在24小时内约有1500个循环）要花一分钟，要遍历所有30,000个ID，将需要20天！

因此，我的解决方法是多次运行此脚本，在运行前总是更改 os.environ 变量，以便每次运行时都可以处理不同的ID。

问题是，对于每个正在运行的脚本，还运行着自己的Firefox（4个进程，一个主进程和3个子进程），分别消耗约1GB的RAM和10％的CPU，从而限制了最大数量的10我可以并行运行的脚本（我需要至少并行运行20个脚本，这样我才能每天下载一次所有数据）

是否可以在单个脚本上运行所有ID，从而消除开销？

谢谢！

运行多个Python / Selenium脚本的更好选择（处理开销）

0 个答案: