我一直在尝试使用python脚本从 this webpage 中获取连接到不同参展商的链接,但结果没有任何结果,也没有错误。我在脚本中使用的类名m-exhibitors-list__items__item__name__link
在页面源中可用,因此它们不是动态生成的。
我应该在脚本中进行哪些更改以获取链接?
这是我尝试过的:
from bs4 import BeautifulSoup
import requests
link = 'https://www.topdrawer.co.uk/exhibitors?page=1'
with requests.Session() as s:
s.headers['User-Agent']='Mozilla/5.0'
response = s.get(link)
soup = BeautifulSoup(response.text,"lxml")
for item in soup.select("a.m-exhibitors-list__items__item__name__link"):
print(item.get("href"))
我需要一个这样的链接(第一个):
https://www.topdrawer.co.uk/exhibitors/alessi-1
答案 0 :(得分:2)
@Life很复杂,因为您以前抓取的网站受到Incapsula service的保护,以保护网站免受Web抓取和其他攻击,它检查请求标头是来自浏览器还是来自机器人(您还是机器人) ),但是网站更可能拥有专有数据,或者它们可能会阻止其他威胁
但是有一些选择可以使用Selenium和BS4实现 以下是代码片段供您参考
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
link = 'https://www.topdrawer.co.uk/exhibitors?page=1'
CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe"
wd = webdriver.Chrome(CHROMEDRIVER_PATH)
response = wd.get(link)
html_page = wd.page_source
soup = BeautifulSoup(html_page,"lxml")
results = soup.findAll("a", {"class" : "m-exhibitors-list__items__item__name__link"})
#interate list of anchor tags to get href attribute
for item in results:
print(item.get("href"))
wd.quit()
答案 1 :(得分:1)
您要抓取的网站受Incapsula保护。
target_url = 'https://www.topdrawer.co.uk/exhibitors?page=1'
response = requests.get(target_url,
headers=http_headers, allow_redirects=True, verify=True, timeout=30)
raw_html = response.text
soupParser = BeautifulSoup(raw_html, 'lxml')
pprint (soupParser.text)
**OUTPUTS**
soupParser = BeautifulSoup(raw_html, 'html')
('Request unsuccessful. Incapsula incident ID: '
'438002260604590346-1456586369751453219')
通读以下内容:https://www.quora.com/How-can-I-scrape-content-with-Python-from-a-website-protected-by-Incapsula