Question

目前正在抓取使用javascript的房地产网站。我的流程首先抓取包含许多不同href链接的列表，然后将这些链接附加到另一个列表，然后按下一个按钮。我这样做，直到下一个按钮不再可点击。

我的问题是，在收集了所有列表（~13000个链接）后，刮刀不会移动到第二部分，它会打开链接并获取我需要的信息。 Selenium甚至没有打开移动到链接列表的第一个元素。

继承我的代码：

for links in houselinklist:
    print(links)
    newwebpage = links
    driver.get(newwebpage)
    html = driver.page_source
    soup = bs.BeautifulSoup(html,'html.parser')
    .
    .
    .
    . more code here

在此之后，我有另一个简单的刮刀，它通过列表列表，在selenium中打开它们并收集该列表的数据。

file = drive.CreateFile({'title': pagename, 
"parents":  [{"id": folder_id}], 
"mimeType": "application/vnd.google-apps.document"})

file.SetContentString("Hello World")

file.Upload()

Answer 1

问题是while True:创建一个运行无穷大的循环。您的except子句有一个pass语句，这意味着一旦发生错误，循环就会继续运行。相反，它可以写成

wait = WebDriverWait(driver, 10)
while True:
    try:
        element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'next')))
        html = driver.page_source
        soup = bs.BeautifulSoup(html,'html.parser')
        table = soup.find(id = 'search_main_div')
        classtitle =  table.find_all('p', class_= 'title')
        for aaa in classtitle:
            hrefsyo =  aaa.find('a', href = True)
            linkstoclick = hrefsyo.get('href')
            houselinklist.append(linkstoclick)
        element.click()
    except:
        break # change this to exit loop

一旦发生错误，循环将break并继续前进到下一行代码

或者你可以只消除while循环，然后使用for循环遍历你的href链接列表

wait = WebDriverWait(driver, 10)
hrefLinks = ['link1','link2','link3'.....]
for link in hrefLinks:
    try:
        driver.get(link)
        element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'next')))
        html = driver.page_source
        soup = bs.BeautifulSoup(html,'html.parser')
        table = soup.find(id = 'search_main_div')
        classtitle =  table.find_all('p', class_= 'title')
        for aaa in classtitle:
            hrefsyo =  aaa.find('a', href = True)
            linkstoclick = hrefsyo.get('href')
            houselinklist.append(linkstoclick)
        element.click()
    except:
        pass #pass on error and move on to next hreflink

使用selenium，beautifulsoup和python进行Webscraping

1 个答案: