完成了几个小小的成功项目,一直在努力从这个网站获取请求 - 任何提示?
更新 - 希望得到完整的美味汤请求,以便我可以从桌子上抓取信息
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.transfermarkt.co.uk/championship/marktwerte/wettbewerb/GB2")
soup = BeautifulSoup(r.content,"html.parser")
print soup
返回
<html>
<head><title>404 Not Found</title></head>
<body bgcolor="white">
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</hr></body>
</html>
答案 0 :(得分:1)
您需要伪装成具有浏览器的真实用户并提供User-Agent
标题:
r = requests.get("http://www.transfermarkt.co.uk/championship/marktwerte/wettbewerb/GB2", headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"
})
演示:
>>> from bs4 import BeautifulSoup
>>> import requests
>>>
>>> r = requests.get("http://www.transfermarkt.co.uk/championship/marktwerte/wettbewerb/GB2", headers={
... "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"
... })
>>> soup = BeautifulSoup(r.content,"html.parser")
>>> print(soup.title.get_text())
Top market values 15/16 - Championship - Transfermarkt
答案 1 :(得分:0)
有些网站请求无法给出响应,因为其中许多网站会跟踪请求发起方是浏览器还是机器人。
所以,让我们看起来像一个浏览器。
可以通过如下修改header来实现:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}
然后,只需将此标头添加到您的 GET 请求中,如下所示:
response = requests.get("https://example.com",headers=headers)
总共您将获得:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}
response = requests.get("https://example.com",headers=headers)