如果互联网连接中断,网页抓取脚本会停止

时间:2017-07-13 13:34:53

标签: python selenium web-scraping scrapy

我使用python selenium web驱动程序编写网页报废脚本它工作正常但问题是数据太大所以当互联网连接中断一段时间我的脚本将停止执行所以我必须从头开始。 请帮助我或使用scrapy框架指导我一些其他解决方案

工作代码:

import time
from bs4 import BeautifulSoup
from selenium import webdriver
import MySQLdb

# Open database connection
db = MySQLdb.connect(host="localhost", # your host, usually localhost
                     user="root", # your username
                     passwd="", # your password
                     db="test") # name of the data base

url = "http://xlnindia.gov.in/frm_G_Cold_S_Query.aspx"
browser = webdriver.Chrome()
browser.get(url)

time.sleep(5)
html = browser.page_source
soup = BeautifulSoup(html, "lxml")

states = [ x["value"] for x in soup.find("td", bgcolor="#ffe0c0").find_all("option") ]

for state in states:

    cur_state = browser.find_element_by_id("ddldistrict")
    cur_state.send_keys(state)

    html = browser.page_source
    soup = BeautifulSoup(html, "lxml")

    districts = [ x["value"] for x in soup.find("td", bgcolor="#ffe0c0").find_next_sibling().find_all("option") ]
    districts = districts[1:]

    for dist in districts:

        district = browser.find_element_by_id("ddldistrict")
        district.send_keys(dist)

        html = browser.page_source
        soup = BeautifulSoup(html, "lxml")

        talukas = [ x["value"] for x in soup.find("td", bgcolor="#ffe0c0").find_next_sibling().find_all("option") ]
        talukas = talukas[1:]

        for taluka in talukas:

            category = browser.find_element_by_id("ddltaluka")
            category.send_keys(taluka)

            button = browser.find_element_by_id("btnSearch")
            button.click()

            time.sleep(10)
            browser.save_screenshot(browser.title + ".JPEG")
            html = browser.page_source
            soup = BeautifulSoup(html, "html.parser")
            table = soup.find(id="dgDisplay")
            if table:
                #print("found")
                cursor=db.cursor()
                for row in table.findAll("tr")[1:]:
                    cells = row.findAll("td")
                    name = cells[0].find(text=True)
                    city = cells[1].find(text=True)
                    licence = cells[2].find(text=True)
                    owner = cells[3].find(text=True)
                    print("Name:"+name+"city:"+city+"licence:"+licence+"owner:"+owner)
                    try:
                       cursor.execute ("INSERT INTO distributors (name, city, licence, owner) VALUES (%s, %s, %s, %s);", (name, city, licence, owner))
                    except (MySQLdb.Error, MySQLdb.Warning) as e:
                        print(e)
                db.commit()        
            #else:  
            #    print("not found")
db.close()                
browser.close()
browser.quit()

1 个答案:

答案 0 :(得分:-2)

通常,如果在找到元素时没有设置隐式超时,则驱动程序会立即抛出异常。因此,如果页面缓慢加载也通过这种方式处理,并且find元素将等待甚至中断。您可以在启动浏览器后添加以下超时并尝试,它可能对您有帮助。

driver.implicitly_wait(30);