我想使用带有初始浏览器(https://splash.readthedocs.io/en/stable/)和自定义标头的Python请求来抓取网站上的某些数据。但是,在开始抓取之前,我决定在这个网站http://xhaus.com/headers上查看我发送的标题。结果,我发现我没有发送我想发送的标题。
import requests
def headers():
headers = requests.utils.default_headers()
headers.update({
'User-Agent': random_user_agent()
})
return headers
def random_user_agent():
with open('user-agents.txt','r') as f:
user_agents = f.readlines()
user_agents = [h.rstrip('\n') for h in user_agents]
random_index = random.randint(0,len(user_agents)-1)
ua = user_agents[random_index]
return ua
splash = 'http://localhost:8050/render.html'
headers = headers()
url_h = 'http://xhaus.com/headers'
page = requests.get(splash, params={'url':url_h,},headers=headers)
运行此代码后,我有以下用户代理:
{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
但是,当我通过我提到的网站查看它时,它会向我显示一个不同的用户代理:
soup = BeautifulSoup(page.text)
print soup.prettify()
...
<td class="even">
User-Agent
</td>
<td class="even">
<b>
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/538.1 (KHTML, like Gecko) splash Safari/538.1
</b>
</td>
...