我正在尝试从kicksusa.com抓取数据,但遇到了一些问题。
当我尝试基本的BS4方法时,像这样(导入是从使用所有这些方法的主程序中复制/粘贴的):
import requests
import csv
import io
import os
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from datetime import datetime
from bs4 import BeautifulSoup
data1 = requests.get('https://www.kicksusa.com/')
soup1 = BeautifulSoup(data1.text, 'html.parser')
button = soup1.find('span', attrs={'class': 'shop-btn'}).text.strip()
print(button)
结果为“无”,它告诉我数据是通过JS隐藏的。因此,我尝试使用Selenium,如下所示:
options = Options()
options.headless = True
options.add_argument('log-level=3')
driver = webdriver.Chrome(options=options)
driver.get('https://www.kicksusa.com/')
url = driver.find_element_by_xpath("//span[@class='shop-btn']").text
print(url)
driver.close()
我得到“无法找到元素”。
有人知道如何使用BS4或Selenium刮掉此站点吗?预先谢谢你!
答案 0 :(得分:1)
问题被检测为机器人并得到如下响应:
<html style="height:100%">
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta name="format-detection" content="telephone=no">
<meta name="viewport" content="initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<script type="text/javascript" src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3"></script>
</head>
<body style="margin:0px;height:100%">
<iframe src="/_Incapsula_Resource?CWUDNSAI=20&xinfo=5-36224256-0%200NNN%20RT%281552245394179%20277%29%20q%280%20-1%20-1%200%29%20r%280%20-1%29%20B15%2811%2c110765%2c0%29%20U2&incident_id=314001710050302156-195663432827669173&edet=15&cinfo=0b000000"
frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula
incident ID: 314001710050302156-195663432827669173
</iframe>
</body>
</html>
如果您要使用requests
和bs
,请从浏览器开发人员工具visid_incap_
和incap_ses_
中将Cookie从请求标头复制到www.kicksusa.com
,并在您的request
:
import requests
from bs4 import BeautifulSoup
headers = {
'Host': 'www.kicksusa.com',
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/72.0.3626.121 Safari/537.36',
'DNT': '1',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'ru,en-US;q=0.9,en;q=0.8,tr;q=0.7',
'Cookie': 'visid_incap_...=put here your visid_incap_ value; incap_ses_...=put here your incap_ses_ value',
}
response = requests.get('https://www.kicksusa.com/', headers=headers)
page = BeautifulSoup(response.content, "html.parser")
shop_buttons = page.select("span.shop-btn")
for button in shop_buttons:
print(button.text)
print("the end")
重新加载页面对我有用。请尝试以下代码:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.kicksusa.com/')
if len(driver.find_elements_by_css_selector("[name=ROBOTS]")) > 0:
driver.get('https://www.kicksusa.com/')
shop_buttons = driver.find_elements_by_css_selector("span.shop-btn")
for button in shop_buttons:
print(button.text)
答案 1 :(得分:1)
对于您想要重复的链接,您可以使用以下CSS选择器将其限制为每对中的第一个
#products-grid .item [href]:first-child
即
.find_elements_by_css_selector("#products-grid .item [href]:first-child")