Question

我试图使用python 3.5上的beautifulsoup（我正在处理eclipse）从网站上删除一些数据，并从网站“http://www.transfermarkt.com/arsenal-fc/startseite/verein/11/saison_id/2015”请求有一些足球运动员的统计数据。

我的代码：

from bs4 import BeautifulSoup
import requests
r=requests.get('http://www.transfermarkt.com/arsenalfc/startseite/verein/11/saison_id/2015')
soup = BeautifulSoup(r.content, 'html.parser')
print (soup.prettify())

我期待一个整洁漂亮的HTML代码，但我得到的输出是：

<html>
 <head>
  <title>
   404 Not Found
  </title>
 </head>
 <body bgcolor="white">
  <center>
   <h1>
    404 Not Found
   </h1>
  </center>
  <hr>
   <center>
    nginx
   </center>
  </hr>
 </body>
</html>

对于不同的网址，它可以正常工作。我尝试过其他几个url并且它有效。但不是这个。难道我做错了什么。任何建议表示赞赏。感谢

Answer 1

您应该使用用户代理使网站认为请求来自浏览器。这对我有用：

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
r=requests.get('http://www.transfermarkt.com/arsenalfc/startseite/verein/11/saison_id/2015', headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.prettify())

使用美丽的汤刮不到特定的URL按预期工作

1 个答案: