request-html返回带有正确网址的错误页面

时间:2018-10-01 19:08:17

标签: url web-scraping scrapy python-requests

我曾经在python 3.6下使用 requests-html 包进行抓取。我已经尝试了相关的网站,但是只有 poetryfoundation.org https://www.poetryfoundation.org/poems/browse#page=1&sort_by=recently_added&topics=20 会返回错误的页面。我将详细演示。

这是源代码,该代码仅导入request-html并返回包装在中的诗歌:
    从request_html导入HTMLSession

class Scrapy:
    def __init__(self, session):
        self.session = session

    def request_content(self, url):
        page = self.session.get(url)
        results = page.html.find('div.c-feature')
        a = True
if __name__ == '__main__':
    session = HTMLSession()
    scrapy = Scrapy(session)

    url = 'https://www.poetryfoundation.org/poems/browse#page=1&sort_by=recently_added&topics=20'
    scrapy.request_content(url=url)

无论我更改url中的参数是什么,它只会返回一个错误的页面

感谢您的时间

1 个答案:

答案 0 :(得分:0)

当您使用requestsselenium时,页面是不同的,因为网站使用的是javascript处理数据

from selenium import webdriver
import requests
url = 'https://www.poetryfoundation.org/poems/browse#page=1&sort_by=recently_added&topics=20'
if __name__ == '__main__':
    with requests.Session() as ses:
        headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36",
        "Accept": "*/*",
        "Referer": "https://www.poetryfoundation.org/poems/browse",
        "Accept-Encoding": "gzip, deflate, br",
}

        req = ses.get(url,headers=headers)
        A = req.text

    dr = webdriver.PhantomJS()
    dr.get(url)
    B = dr.page_source
    dr.close()
    print(type(A) == type(B))
    print(A == B)
    print(len(A),len(B))

输出

True # type(A) == type(B)
False # A == B
365477 482831