无法使用Python请求获取整个HTML页面

时间:2020-09-07 09:36:16

标签: python html python-requests-html

我在“反人类的卡片”游戏卡编辑器中工作。为了获得创意,我希望以编程方式从以下web page下载整个卡片组。 使用检查工具,我发现了存储卡的位置:

Location of card description

可以看出,在whitecards类和blackcards类中,可以找到每个卡片ID,其中写有卡片短语或想法。

我的代码的一般功能是提供卡片组URL并删除所有卡片示例(白色和黑色)。 我的第一种方法是在Python中使用Requests包。我使用了以下代码:

import requests
from bs4 import BeautifulSoup

URL = 'https://cardslackingoriginality.com/expansions/5e758e4034489b003f4529f6/view'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

root = soup.find(id='root')

尽管如此,当检查根对象时,我发现它是空的,但它应该包含所有白卡和黑卡类。

1 个答案:

答案 0 :(得分:0)

通常情况下,网页在初始页面加载时并未完全加载。通常,在页面加载后,JavaScript代码会执行一个或多个AJAX请求,从而导致DOM被修改,这就是为什么使用requests来获取页面不会产生最终的完整DOM的原因。因此,我在浏览器中加载了页面,并查看了页面加载后发出的XHR网络请求。然而似乎没有人返回丢失的信息。因此,这有点令人困惑。因此,我的解决方案是使用Selenium来驱动浏览器(在下面的示例中为Chrome)并抓取页面。在初始页面加载之后,有必要等待一两秒钟,以确保DOM完整:

from selenium import webdriver
from bs4 import BeautifulSoup
import time

URL = 'https://cardslackingoriginality.com/expansions/5e758e4034489b003f4529f6/view'
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
driver.get(URL)
time.sleep(1) # wait a second for <div id="root"> to be fully loaded
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
root = soup.find(id='root')
print(root)

更新

我更加仔细地看了看AJAX调用,看起来以下URL将返回您感兴趣的实际数据:

https://cardslackingoriginality.com/expansions/5e758e4034489b003f4529f6/get
import requests


URL = 'https://cardslackingoriginality.com/expansions/5e758e4034489b003f4529f6/get'
resp = requests.get(URL)
print(resp.json())

打印:

{'success': True, 'expansion': {'_id': '5e758e4034489b003f4529f6', 'name': 'Global Pandemic Pack', 'author': '5dfde1f4897a0f003e2fb547', 'description': "Who says in-house quarantine has to suck? For the price of a handful of toilet paper rolls, you can gain some original pandemic-themed cards that'll surely spice up your card games. Get your hands on the first-ever official Cards Lacking Originality card pack now! I mean it, right now!", 'price': 0, 'published': True, 'featured': True, 'dateCreated': '2020-03-21T03:47:12.167Z', '__v': 0, 'gamesUsed': 655, 'whiteCards': ['$1,200 Trump bucks.', 'A free extra week on the cruise ship!', 'A long Zoom meeting with no obvious purpose.', 'A lukewarm bowl of bat soup.', 'A mass panic caused by a sneeze.', 'Babies concieved under quarantine.', 'Beautiful cross-cultural friendships.', 'Binging 30 straight seasons of "The Simpsons."', 'Burying your head in a screen to escape family time.', 'Costco: Battle Royale.', 'Craving any excuse to party.', 'Crying and then sleeping and then crying.', 'Eating all the quarantine food within a day.', 'Ejaculating into the air and trying to catch it in your mouth.', 'Exchanging blowjobs for Kleenex and toilet paper.', 'Forgetting what genuine human connection feels like.', 'Groupons at funeral homes.', 'Hating the media.', 'Insatiable horniness.', 'Kung Flu fighting.', "My Gram-Gram's loooooong vacation!", 'Online class shootings.', 'Only washing hands after the CDC says you have to.', 'Plague, Inc.', 'Praying for the sweet release of death.', 'Raging Ebola.', 'Rediscovering the wonders of video games.', 'Some Lyme disease to go with your Coronavirus.', 'The National Guard.', 'The other eighteen COVIDs.', 'Unnecessarily sensual Zoom messages.'], 'blackCards': ['America: #1 in _______!', "Doctor, I've been doing _______ lately and I fear that I may be very sick.", 'I cannot BELIEVE that the grocery store is sold out of _______ already!', 'We regret to inform you that _______ has officially been cancelled due to COVID-19.', 'What is the one good thing about this pandemic?', 'What was the most difficult thing to give up for social distancing?', "What's really to blame for the spread of the virus?", "What's the best way to kill time while trapped inside the house?", "_______ is the entire reason I'm still holding onto some sanity."]}}