Python&BeautifulSoup 4 / Selenium-无法从kicksusa.com获取数据?

时间:2019-03-10 16:18:17

标签: python selenium selenium-webdriver web-scraping beautifulsoup

我正在尝试从kicksusa.com抓取数据,但遇到了一些问题。

当我尝试基本的BS4方法时,像这样(导入是从使用所有这些方法的主程序中复制/粘贴的):

import requests
import csv
import io
import os
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from datetime import datetime
from bs4 import BeautifulSoup

data1 = requests.get('https://www.kicksusa.com/')
soup1 = BeautifulSoup(data1.text, 'html.parser')

button = soup1.find('span', attrs={'class': 'shop-btn'}).text.strip()
print(button)

结果为“无”,它告诉我数据是通过JS隐藏的。因此,我尝试使用Selenium,如下所示:

options = Options()
options.headless = True
options.add_argument('log-level=3')
driver = webdriver.Chrome(options=options)
driver.get('https://www.kicksusa.com/') 
url = driver.find_element_by_xpath("//span[@class='shop-btn']").text
print(url)
driver.close()

我得到“无法找到元素”。

有人知道如何使用BS4或Selenium刮掉此站点吗?预先谢谢你!

2 个答案:

答案 0 :(得分:1)

问题被检测为机器人并得到如下响应:

<html style="height:100%">
    <head>
        <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
        <meta name="format-detection" content="telephone=no">
        <meta name="viewport" content="initial-scale=1.0">
        <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
        <script type="text/javascript" src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3"></script>
    </head>
    <body style="margin:0px;height:100%">
    <iframe src="/_Incapsula_Resource?CWUDNSAI=20&xinfo=5-36224256-0%200NNN%20RT%281552245394179%20277%29%20q%280%20-1%20-1%200%29%20r%280%20-1%29%20B15%2811%2c110765%2c0%29%20U2&incident_id=314001710050302156-195663432827669173&edet=15&cinfo=0b000000"
            frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula
        incident ID: 314001710050302156-195663432827669173
    </iframe>
    </body>
</html>

请求和BeautifulSoup

如果您要使用requestsbs,请从浏览器开发人员工具visid_incap_incap_ses_中将Cookie从请求标头复制到www.kicksusa.com,并在您的request

import requests
from bs4 import BeautifulSoup

headers = {
    'Host': 'www.kicksusa.com',
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/72.0.3626.121 Safari/537.36',
    'DNT': '1',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'ru,en-US;q=0.9,en;q=0.8,tr;q=0.7',
    'Cookie': 'visid_incap_...=put here your visid_incap_ value; incap_ses_...=put here your incap_ses_ value',
}

response = requests.get('https://www.kicksusa.com/', headers=headers)

page = BeautifulSoup(response.content, "html.parser")

shop_buttons = page.select("span.shop-btn")
for button in shop_buttons:
    print(button.text)

print("the end")

当您运行Selenium 有时时,会得到相同的响应: enter image description here

重新加载页面对我有用。请尝试以下代码:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://www.kicksusa.com/')

if len(driver.find_elements_by_css_selector("[name=ROBOTS]")) > 0:
    driver.get('https://www.kicksusa.com/')

shop_buttons = driver.find_elements_by_css_selector("span.shop-btn")
for button in shop_buttons:
    print(button.text)

答案 1 :(得分:1)

对于您想要重复的链接,您可以使用以下CSS选择器将其限制为每对中的第一个

#products-grid .item [href]:first-child

.find_elements_by_css_selector("#products-grid .item [href]:first-child")