bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: html5lib. Do you need to install a parser library?

时间:2017-06-15 10:12:30

标签: python html5 python-3.x parsing beautifulsoup

This question is similar to this one. I have read the answers, but none worked for me. I am trying to get the informations from the bluish box in this site.

This is what I wrote:

import requests
from bs4 import BeautifulSoup
import re

url = 'https://boardgamegeek.com/boardgame/161936/pandemic-legacy-season-1'

req = requests.get(url)
soup = BeautifulSoup(req.text,'html5lib')
soup = soup.find('div', class_='game-header-body')

print(soup.prettify())

I get this error AttributeError: 'NoneType' object has no attribute 'prettify'. The reason is because it cannot find the 'game-header-body', therefore becomes NoneType. When I remove the soup = soup.find('div', class_='game-header-body') line, I can see all the html code except the div I am interested in.

I have read that maybe it is better to change to the 'html5lib' parser library. I installed it through pip3 install html5lib (I am using python 3.4.3), but still I get the aforementioned error. What should I do?

1 个答案:

答案 0 :(得分:1)

The element game-header-body is not present in the HTML source but is rendered later by javascript. You need something like selenium to help with this. It can load the browser of your choice (including a headerless one if needed) which will then do the javascript for you. You can then access the resulting HTML after the page has fully loaded and parse it using BeautifulSoup.

The following would be an example of how this could be done using an already installed Firefox browser:

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

url = 'https://boardgamegeek.com/boardgame/161936/pandemic-legacy-season-1'

browser = webdriver.Firefox(firefox_binary=FirefoxBinary())
browser.get(url)
soup = BeautifulSoup(browser.page_source, "html.parser")
browser.quit()

for div in soup.find_all('div', class_='game-header-body'):
    print(div.prettify())
    print("----------------")

Note, there are multiple game-header-body divs, so this displays all of them.