request_html渲染返回访问被拒绝

时间:2020-08-16 19:14:20

标签: python python-3.x

当尝试使用requests_html呈现页面时,服务器拒绝访问。通过请求发送时,我得到了HTML。

为什么我的访问被拒绝?

代码

from requests_html import HTMLSession
s = HTMLSession()

base_url = 'https://secure.louisvuitton.com/eng-gb/checkout/review'

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:79.0) Gecko/20100101 Firefox/79.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-GB,en;q=0.5',
    'Upgrade-Insecure-Requests': '1',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
    'TE': 'Trailers',
}

r = s.get('https://secure.louisvuitton.com/eng-gb/checkout/review', headers=headers)
print(r)


r.html.render()
print(r.html.text)

终端

<Response [200]>
Access Denied
Access Denied
You don't have permission to access "http://secure.louisvuitton.com/eng-gb/checkout/review" on this server.
Reference #18.6fce7a5c.1597604631.1e8bfd7

1 个答案:

答案 0 :(得分:1)

该网站似乎不喜欢无头的浏览器,它从User-Agent标头中检测到此错误。就我而言,是:

Mozilla / 5.0(Windows NT 10.0; Win64; x64)AppleWebKit / 537.36(KHTML, 像壁虎) HeadlessChrome /60.0.3112.113 Safari / 537.36

现在,requests_html模块正在使用Pyppeteer来渲染JavaScript。在Pyppeteer中可以为page设置UA的选项,但是我看不到一种方便的方法来覆盖某些类以进行此更改。 page是在_async_render函数中定义的(准确地说是coroutine)。

您可以尝试直接使用Pyppeteer,然后仅使用requests_html解析HTML:

import asyncio
import traceback

from pyppeteer import launch
from requests_html import HTML

URL = 'https://secure.louisvuitton.com/eng-gb/checkout/review'
UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'


async def fetch(url, browser):
    page = await browser.newPage()
    await page.setUserAgent(UA)

    try:
        await page.goto(url, {'waitUntil': 'load'})
    except:
        traceback.print_exc()
    else:
        return await page.content()
    finally:
        await page.close()


async def main():
    browser = await launch(headless=True, args=['--no-sandbox'])

    doc = await fetch(URL, browser)
    await browser.close()

    html = HTML(html=doc)
    print(html.links)


if __name__ == '__main__':
    asyncio.run(main())