网站迭代问题中的硒抓取数据

时间:2020-11-03 08:45:53

标签: selenium scrapy web-crawler

嗨,我正在尝试从某个网站https://web.archive.org/web/*/https://cd.lianjia.com/抓取一个特殊号码

import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
import requests
from bs4 import BeautifulSoup
import re

headers = {
    'Access-Control-Allow-Origin': '*',
    'Access-Control-Allow-Methods': 'GET',
    'Access-Control-Allow-Headers': 'Content-Type',
    'Access-Control-Max-Age': '3600',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }
path = 'C:\Windows\chromedriver.exe'
driver = webdriver.Chrome(path)
driver.implicitly_wait(10)
driver.get('https://web.archive.org/web/*/https://cd.lianjia.com/ ')

data = []
for i in list(range(19,26)):
    element =driver.find_element_by_xpath('/html/body/div[4]/div[3]/div/div[2]/span[%d]' %i)
    actions = ActionChains(driver)
    actions.move_to_element(element).click().perform()
    days = driver.find_elements_by_css_selector("div.calendar-day >a")
    for n in days:
        url = n.get_attribute('href')
        req = requests.get(url, headers)
        soup = BeautifulSoup(req.content, 'html.parser')
        target = soup.findAll("div",{"class":"house-num"})
        se_house =int(re.findall(r'[0-9]+', str(target))[0])
        data.append(se_house)

因此,在这里我尝试进入每个特殊日期链接,并在其中获取一个特殊数字,如您在红色圆圈中的图片所示,我使用BeautifulSoup按class = house-num查找内容。{ {3}},如果我打印目标,它看起来像这样:enter image description here,所以我直接通过se_house = int(re.findall(r'[0-9] +',str(目标))[0])。然后,我运行此代码以获取该数字的所有日期,但最终我的数据列表中的所有数据都为134591,例如enter image description here,有人可以提出一些我做错地方的建议。

0 个答案:

没有答案