美丽的汤不拉所有的网页的HTML

时间:2018-01-21 10:42:00

标签: python html beautifulsoup

我正在尝试使用BeautifulSoup练习。我想从这个网站上提取足球运动员图像的图像地址:https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652

当我inspect'代码时,img src的部分如下:

    <div class="dataBild">
    <img src="https://tmssl.akamaized.net//images/portrait/header/195652-1456301478.jpg?lm=1456301501" title="Jordon Ibe" alt="Jordon Ibe" class="">
<div class="bildquelle"><span title="imago">imago</span></div>            
</div>

所以我想我可以使用BeautifulSoupdiv class = "DataBild",因为这是唯一的。

# Import the Libraries that I need
import urllib3
import certifi
from bs4 import BeautifulSoup

# Specify the URL
url = 'https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652'
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
response = http.request('GET', url)


#Parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(response.data, "html.parser")

player_img = soup.find_all('div', {'class':'dataBild'})
print (player_img)

这会运行,但不会输出任何内容。所以我只是通过运行print(soup)

来检查
# Import the Libraries that I need
import urllib3
import certifi
from bs4 import BeautifulSoup

# Specify the URL
url = 'https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652'
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
response = http.request('GET', url)


#Parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(response.data, "html.parser")

print(soup)

此输出

<html>
<head><title>404 Not Found</title></head>
<body bgcolor="white">
<center><h1>404 Not Found</h1></center>
<hr/><center>nginx</center>
</body>
</html>

所以显然没有从网页上提取所有HTML? 为什么会这样?我的逻辑是寻找div class = DataBild sound

1 个答案:

答案 0 :(得分:2)

该网站似乎检查请求的User-Agent标头是否有效。

所以你需要像这样添加标题:

import urllib3
import certifi

url = 'https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652'
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
response = http.request('GET', url, headers={'User-Agent': 'Mozilla/5.0'})
print(response.status)

这会打印200。如果您删除标题,则会获得404

任何非空的User-Agent值(在修剪空格之后)似乎都有效。