Question

嗨，我正在尝试从某个网站https://web.archive.org/web/*/https://cd.lianjia.com/抓取一个特殊号码

import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
import requests
from bs4 import BeautifulSoup
import re

headers = {
    'Access-Control-Allow-Origin': '*',
    'Access-Control-Allow-Methods': 'GET',
    'Access-Control-Allow-Headers': 'Content-Type',
    'Access-Control-Max-Age': '3600',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }
path = 'C:\Windows\chromedriver.exe'
driver = webdriver.Chrome(path)
driver.implicitly_wait(10)
driver.get('https://web.archive.org/web/*/https://cd.lianjia.com/ ')

data = []
for i in list(range(19,26)):
    element =driver.find_element_by_xpath('/html/body/div[4]/div[3]/div/div[2]/span[%d]' %i)
    actions = ActionChains(driver)
    actions.move_to_element(element).click().perform()
    days = driver.find_elements_by_css_selector("div.calendar-day >a")
    for n in days:
        url = n.get_attribute('href')
        req = requests.get(url, headers)
        soup = BeautifulSoup(req.content, 'html.parser')
        target = soup.findAll("div",{"class":"house-num"})
        se_house =int(re.findall(r'[0-9]+', str(target))[0])
        data.append(se_house)

因此，在这里我尝试进入每个特殊日期链接，并在其中获取一个特殊数字，如您在红色圆圈中的图片所示，我使用BeautifulSoup按class = house-num查找内容。{ {3}}，如果我打印目标，它看起来像这样：enter image description here，所以我直接通过se_house = int（re.findall（r'[0-9] +'，str（目标））[0]）。然后，我运行此代码以获取该数字的所有日期，但最终我的数据列表中的所有数据都为134591，例如enter image description here，有人可以提出一些我做错地方的建议。

网站迭代问题中的硒抓取数据

0 个答案: