我曾经在python 3.6下使用 requests-html 包进行抓取。我已经尝试了相关的网站,但是只有 poetryfoundation.org https://www.poetryfoundation.org/poems/browse#page=1&sort_by=recently_added&topics=20 会返回错误的页面。我将详细演示。
这是源代码,该代码仅导入request-html并返回包装在中的诗歌:
从request_html导入HTMLSession
class Scrapy:
def __init__(self, session):
self.session = session
def request_content(self, url):
page = self.session.get(url)
results = page.html.find('div.c-feature')
a = True
if __name__ == '__main__':
session = HTMLSession()
scrapy = Scrapy(session)
url = 'https://www.poetryfoundation.org/poems/browse#page=1&sort_by=recently_added&topics=20'
scrapy.request_content(url=url)
无论我更改url中的参数是什么,它只会返回一个错误的页面
感谢您的时间
答案 0 :(得分:0)
当您使用requests
和selenium
时,页面是不同的,因为网站使用的是javascript处理数据
from selenium import webdriver
import requests
url = 'https://www.poetryfoundation.org/poems/browse#page=1&sort_by=recently_added&topics=20'
if __name__ == '__main__':
with requests.Session() as ses:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36",
"Accept": "*/*",
"Referer": "https://www.poetryfoundation.org/poems/browse",
"Accept-Encoding": "gzip, deflate, br",
}
req = ses.get(url,headers=headers)
A = req.text
dr = webdriver.PhantomJS()
dr.get(url)
B = dr.page_source
dr.close()
print(type(A) == type(B))
print(A == B)
print(len(A),len(B))
输出
True # type(A) == type(B)
False # A == B
365477 482831