Question

我一直在尝试使用python脚本从 this webpage 中获取连接到不同参展商的链接，但结果没有任何结果，也没有错误。我在脚本中使用的类名m-exhibitors-list__items__item__name__link在页面源中可用，因此它们不是动态生成的。

我应该在脚本中进行哪些更改以获取链接？

这是我尝试过的：

from bs4 import BeautifulSoup
import requests

link = 'https://www.topdrawer.co.uk/exhibitors?page=1'

with requests.Session() as s: 
    s.headers['User-Agent']='Mozilla/5.0'  
    response = s.get(link)
    soup = BeautifulSoup(response.text,"lxml")
    for item in soup.select("a.m-exhibitors-list__items__item__name__link"):
        print(item.get("href"))

我需要一个这样的链接（第一个）：

https://www.topdrawer.co.uk/exhibitors/alessi-1

Answer 1

@Life很复杂，因为您以前抓取的网站受到Incapsula service的保护，以保护网站免受Web抓取和其他攻击，它检查请求标头是来自浏览器还是来自机器人（您还是机器人）），但是网站更可能拥有专有数据，或者它们可能会阻止其他威胁

但是有一些选择可以使用Selenium和BS4实现以下是代码片段供您参考

from bs4 import BeautifulSoup
from selenium import webdriver
import requests

link = 'https://www.topdrawer.co.uk/exhibitors?page=1'

CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe" 

wd = webdriver.Chrome(CHROMEDRIVER_PATH)

response = wd.get(link)

html_page = wd.page_source
soup = BeautifulSoup(html_page,"lxml")
results = soup.findAll("a", {"class" : "m-exhibitors-list__items__item__name__link"})

#interate list of anchor tags to get href attribute
for item in results:
    print(item.get("href"))
wd.quit()

Answer 2

您要抓取的网站受Incapsula保护。

target_url = 'https://www.topdrawer.co.uk/exhibitors?page=1'

response = requests.get(target_url, 
headers=http_headers, allow_redirects=True, verify=True, timeout=30)
raw_html = response.text
soupParser = BeautifulSoup(raw_html, 'lxml')

pprint (soupParser.text)

**OUTPUTS**
soupParser = BeautifulSoup(raw_html, 'html')
('Request unsuccessful. Incapsula incident ID: '
'438002260604590346-1456586369751453219')

通读以下内容：https://www.quora.com/How-can-I-scrape-content-with-Python-from-a-website-protected-by-Incapsula

以及这些：https://stackoverflow.com/search?q=Incapsula

无法从网页获取连接到不同参展商的链接

2 个答案: