This question is similar to this one. I have read the answers, but none worked for me. I am trying to get the informations from the bluish box in this site.
This is what I wrote:
import requests
from bs4 import BeautifulSoup
import re
url = 'https://boardgamegeek.com/boardgame/161936/pandemic-legacy-season-1'
req = requests.get(url)
soup = BeautifulSoup(req.text,'html5lib')
soup = soup.find('div', class_='game-header-body')
print(soup.prettify())
I get this error AttributeError: 'NoneType' object has no attribute 'prettify'
. The reason is because it cannot find the 'game-header-body', therefore becomes NoneType
. When I remove the soup = soup.find('div', class_='game-header-body')
line, I can see all the html code except the div I am interested in.
I have read that maybe it is better to change to the 'html5lib' parser library. I installed it through pip3 install html5lib
(I am using python 3.4.3), but still I get the aforementioned error. What should I do?
答案 0 :(得分:1)
The element game-header-body
is not present in the HTML source but is rendered later by javascript. You need something like selenium to help with this. It can load the browser of your choice (including a headerless one if needed) which will then do the javascript for you. You can then access the resulting HTML after the page has fully loaded and parse it using BeautifulSoup.
The following would be an example of how this could be done using an already installed Firefox browser:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
url = 'https://boardgamegeek.com/boardgame/161936/pandemic-legacy-season-1'
browser = webdriver.Firefox(firefox_binary=FirefoxBinary())
browser.get(url)
soup = BeautifulSoup(browser.page_source, "html.parser")
browser.quit()
for div in soup.find_all('div', class_='game-header-body'):
print(div.prettify())
print("----------------")
Note, there are multiple game-header-body
divs, so this displays all of them.