BeautifulSoup疑难解答涉及平面HTML层次结构和next_sibling循环

时间:2015-05-21 17:31:53

标签: python beautifulsoup

所以我有一个平面层次结构HTML:

<div class="caption">
  <strong>July 1</strong>
  <br>
  <em>Top Gun</em>
  <br>
  "Location: Millennium Park"
  <br>
  "Amenities: Please be a volleyball tournament..."
  <br>
  <em>Captain Phillips</em>
  <br>
  "Location: Montgomery Ward Park"
  <br>
  <br>
  <strong>July 2</strong>
  <br>
  <em>The Fantastic Mr. Fox </em>

我从一开始就用我的代码绊倒了......我是不是错误地使用了find_sibling,或者我在运行print title时还有什么问题无法回复?谢谢你们。

import csv
import re
from bs4 import BeautifulSoup
from urllib2 import urlopen

URL = 'http://www.thrillist.com/entertainment/chicago/free-outdoor-summer-movies-chicago'

html = urlopen(URL).read()
soup = BeautifulSoup(html)

root = soup.find_all("strong")
for row in root:
    sibling = row.next_sibling
    while sibling and sibling.name != "strong":
        if sibling.name == "em":
            title = sibling.text
        sibling = sibling.next_sibling
print title <---- still not getting the movie titles under tag<em>

1 个答案:

答案 0 :(得分:2)

Setting an underlying parserlxml(需要安装),或html.parser帮助我解决问题(像往常一样,所有@abarnert学分),演示:

>>> from bs4 import BeautifulSoup
>>> from urllib2 import urlopen
>>> 
>>> URL = 'http://www.thrillist.com/entertainment/chicago/free-outdoor-summer-movies-chicago'
>>> html = urlopen(URL).read()
>>> len(BeautifulSoup(html, "html.parser").find_all('strong'))
81
>>> len(BeautifulSoup(html, "lxml").find_all('strong'))
81
>>> len(BeautifulSoup(html, "html5lib").find_all('strong'))
0

请注意,如果您未明确指定解析器,BeautifulSoup将自动选择解析器:

  

如果您没有指定任何内容,您将获得最佳的HTML解析器   安装。然后,Beautiful Soup将lxml的解析器列为最佳解析器   html5lib,然后是Python的内置解析器。

我想,在您的情况下,选择是html5lib,并且,正如您在演示中看到的那样,它存在问题,找不到strong个标签,因此,您不会见title打印。

另外,按照@ abarnert的说明,一旦你点击下一个strong标签就需要退出内循环:

root = soup.find_all("strong")
for row in root:
    for sibling in row.next_siblings:
        if sibling.name == "strong":
            break
        if sibling.name == "em":
            print sibling.text

打印:

A League of Their Own
It's a Mad, Mad, Mad, Mad World
Monsters University 
...
Cloudy with a Chance of Meatballs 2
Best in Show
Ironman 3
Sean Cooley is Thrillist's Chicago Editor and is still mad that Ben Affleck is the new Batman. Follow him @SeanCooley.