无法从网页获取连接到不同参展商的链接

时间:2019-01-04 17:21:24

标签: python python-3.x web-scraping beautifulsoup

我一直在尝试使用python脚本从 this webpage 中获取连接到不同参展商的链接,但结果没有任何结果,也没有错误。我在脚本中使用的类名m-exhibitors-list__items__item__name__link在页面源中可用,因此它们不是动态生成的。

  

我应该在脚本中进行哪些更改以获取链接?

这是我尝试过的:

from bs4 import BeautifulSoup
import requests

link = 'https://www.topdrawer.co.uk/exhibitors?page=1'

with requests.Session() as s: 
    s.headers['User-Agent']='Mozilla/5.0'  
    response = s.get(link)
    soup = BeautifulSoup(response.text,"lxml")
    for item in soup.select("a.m-exhibitors-list__items__item__name__link"):
        print(item.get("href"))

我需要一个这样的链接(第一个):

https://www.topdrawer.co.uk/exhibitors/alessi-1

2 个答案:

答案 0 :(得分:2)

@Life很复杂,因为您以前抓取的网站受到Incapsula service的保护,以保护网站免受Web抓取和其他攻击,它检查请求标头是来自浏览器还是来自机器人(您还是机器人) ),但是网站更可能拥有专有数据,或者它们可能会阻止其他威胁

但是有一些选择可以使用Selenium和BS4实现 以下是代码片段供您参考

from bs4 import BeautifulSoup
from selenium import webdriver
import requests

link = 'https://www.topdrawer.co.uk/exhibitors?page=1'

CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe" 

wd = webdriver.Chrome(CHROMEDRIVER_PATH)

response = wd.get(link)

html_page = wd.page_source
soup = BeautifulSoup(html_page,"lxml")
results = soup.findAll("a", {"class" : "m-exhibitors-list__items__item__name__link"})

#interate list of anchor tags to get href attribute
for item in results:
    print(item.get("href"))
wd.quit()  

答案 1 :(得分:1)

您要抓取的网站受Incapsula保护。

target_url = 'https://www.topdrawer.co.uk/exhibitors?page=1'

response = requests.get(target_url, 
headers=http_headers, allow_redirects=True, verify=True, timeout=30)
raw_html = response.text
soupParser = BeautifulSoup(raw_html, 'lxml')

pprint (soupParser.text)

**OUTPUTS**
soupParser = BeautifulSoup(raw_html, 'html')
('Request unsuccessful. Incapsula incident ID: '
'438002260604590346-1456586369751453219')

通读以下内容:https://www.quora.com/How-can-I-scrape-content-with-Python-from-a-website-protected-by-Incapsula

以及这些:https://stackoverflow.com/search?q=Incapsula