使用Python中的selenium抓取网站时拒绝访问

时间:2018-05-21 18:46:40

标签: python selenium web-scraping access-denied data-extraction

您好我正试图从Macy的网站上提取信息,特别是从这个类别='https://www.macys.com/shop/featured/women-handbags'。但是当我访问特定的项目页面时,我得到一个空白页面,其中包含以下消息:

拒绝访问 您无权访问此服务器上的“上述类别链接中列出的任何项目链接”。 参考文献#18.14d6f7bd.1526927300.12232a22

我也尝试使用chrome选项更改用户代理,但它不起作用。

这是我的代码:

import sys
reload(sys)
sys.setdefaultencoding('utf8')
from selenium import webdriver 
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

url = 'https://www.macys.com/shop/featured/women-handbags'

def init_selenium():
    global driver
    driver = webdriver.Chrome("/Users/rodrigopeniche/Downloads/chromedriver")
    driver.get(url)

def find_page_items():
    items_elements = driver.find_elements_by_css_selector('li.productThumbnailItem')
    for index, element in enumerate(items_elements):
    items_elements = driver.find_elements_by_css_selector('li.productThumbnailItem')
    item_link = items_elements[index].find_element_by_tag_name('a').get_attribute('href')
    driver.get(item_link)
    driver.back()


init_selenium()
find_page_items()

知道发生了什么,我该怎么做才能解决它?

1 个答案:

答案 0 :(得分:0)

它不是面向硒的解决方案(全部通过),但它有效。你可以尝试一下。

from selenium import webdriver 
import requests
from bs4 import BeautifulSoup

url = 'https://www.macys.com/shop/featured/women-handbags'

def find_page_items(driver,link):
    driver.get(link)
    item_link = [item.find_element_by_tag_name('a').get_attribute('href') for item in driver.find_elements_by_css_selector('li.productThumbnailItem')]
    for newlink in item_link:
        res = requests.get(newlink,headers={"User-Agent":"Mozilla/5.0"})
        soup = BeautifulSoup(res.text,"lxml")
        name = soup.select_one("h1[itemprop='name']").text.strip()
        print(name)

if __name__ == '__main__':
    driver = webdriver.Chrome()
    try:
        find_page_items(driver,url)
    finally:
        driver.quit()

输出:

Mercer Medium Bonded-Leather Crossbody
Mercer Large Tote
Nolita Medium Satchel
Voyager Medium Multifunction Top-Zip Tote
Mercer Medium Crossbody
Kelsey Large Crossbody
Medium Mercer Gallery
Mercer Large Center Tote
Signature Raven Large Tote

但是,如果您坚持使用selenium,那么每次浏览新网址时都需要创建它的新实例,或者更好的选择是清除缓存。