当尝试使用requests_html呈现页面时,服务器拒绝访问。通过请求发送时,我得到了HTML。
为什么我的访问被拒绝?
代码
from requests_html import HTMLSession
s = HTMLSession()
base_url = 'https://secure.louisvuitton.com/eng-gb/checkout/review'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:79.0) Gecko/20100101 Firefox/79.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-GB,en;q=0.5',
'Upgrade-Insecure-Requests': '1',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'TE': 'Trailers',
}
r = s.get('https://secure.louisvuitton.com/eng-gb/checkout/review', headers=headers)
print(r)
r.html.render()
print(r.html.text)
终端
<Response [200]>
Access Denied
Access Denied
You don't have permission to access "http://secure.louisvuitton.com/eng-gb/checkout/review" on this server.
Reference #18.6fce7a5c.1597604631.1e8bfd7
答案 0 :(得分:1)
该网站似乎不喜欢无头的浏览器,它从User-Agent
标头中检测到此错误。就我而言,是:
Mozilla / 5.0(Windows NT 10.0; Win64; x64)AppleWebKit / 537.36(KHTML, 像壁虎) HeadlessChrome /60.0.3112.113 Safari / 537.36
现在,requests_html
模块正在使用Pyppeteer来渲染JavaScript。在Pyppeteer
中可以为page设置UA的选项,但是我看不到一种方便的方法来覆盖某些类以进行此更改。 page
是在_async_render
函数中定义的(准确地说是coroutine
)。
您可以尝试直接使用Pyppeteer
,然后仅使用requests_html
解析HTML:
import asyncio
import traceback
from pyppeteer import launch
from requests_html import HTML
URL = 'https://secure.louisvuitton.com/eng-gb/checkout/review'
UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
async def fetch(url, browser):
page = await browser.newPage()
await page.setUserAgent(UA)
try:
await page.goto(url, {'waitUntil': 'load'})
except:
traceback.print_exc()
else:
return await page.content()
finally:
await page.close()
async def main():
browser = await launch(headless=True, args=['--no-sandbox'])
doc = await fetch(URL, browser)
await browser.close()
html = HTML(html=doc)
print(html.links)
if __name__ == '__main__':
asyncio.run(main())