如何抓取硒中的所有网页?

时间:2020-04-18 14:28:02

标签: selenium-webdriver web-scraping beautifulsoup

我已经编写了这段代码。我想从所有页面获取所有数据并将它们存储在CSV文件中。但是我不知道下一步该怎么做。我可以在beautifulsoup中做到这一点,但在硒和beautifulsoup组合中却不一样

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep
from bs4 import BeautifulSoup as bs
import pandas

chrome_option = Options()
chrome_option.add_argument("--headless")


browser = webdriver.Chrome(executable_path="D:/chromedriver.exe", chrome_options=chrome_option)

# list of data
names = []
license_num = []
types = []
contacts = []
links = []



def scrape(url):
    browser.get(url)
    sleep(5)
    html = browser.execute_script("return document.documentElement.outerHTML")

    sel_soup = bs(html, "html.parser")

    containers = sel_soup.find_all(class_ = "d-agent-card container_13WXz card_2AIgF")


    for cont in containers:
        # agent name
        agent = cont.find("h4").text.strip()
        names.append(agent)

        # agent type

        tp = cont.find("h5").text.strip()
        types.append(tp)

        # agent contact

        contact = cont.find("button", {"class": "button_3pYtF icon-left_1xpTg secondary_KT7Sy link_-iSRx contactLink_BgG5h"})
        if contact is not None:
            contacts.append(contact.text.strip())
        elif contact is None:
            contacts.append("None")

        # agent link

        link = cont.find("div", {"class": "linksContainer_1-v7q"}).find("a")

        if link is not None:
            links.append(link["href"])
        elif link is None:
            links.append("None")

        # license 

        licns = cont.find("p", {"class": "license_33m8Z"}).text
        license_num.append(licns)

for page in range(1, 27):
    urls = f"https://www.remax.com/real-estate-agents/Dallas-TX?page={page}"
    scrape(urls)


df = pandas.DataFrame({
    "Agent Name": names,
    "Agent Type" : types,
    "Agent License Number": license_num,
    "Agent contact Number": contacts,
    "Agent URL": links
    })
df.to_csv("data.csv", index=False)   

这样我只能得到596行数据。但是我想获得24x25 + 19 = 619行。每页有24行数据。我想得到他们。但也许我只得到23页的数据。现在我遇到一个错误...

{“ QTM:JSEvent:类型错误:无法读取未定义的!! window.QuantumMetricAPI.lastXHR.data && !! JSON.parse(QuantumMetricAPI.lastXHR.data).term的属性“数据””}

0 个答案:

没有答案