在解析特殊的平面层次结构时控制流逻辑HTML w BeautifulSoup

时间:2015-05-21 23:07:02

标签: python date dictionary beautifulsoup

我正在尝试从这个HTML编译一个dict,它有一个不寻常的平面HTML。格式也很特殊,因为它没有提供每个电影片名的日期,因为当天播放的电影片名在日期下列出(有些有一个,有些有多个)。

这是HTML的片段:

<div class="caption">
    <strong>July 1</strong>
    <br>
    <em>Top Gun</em>
    <br>
    "Location: Millennium Park"
    <br>
    "Amenities: Please be a volleyball tournament..."
    <br>
    <em>Captain Phillips</em>
    <br>
    "Location: Montgomery Ward Park"
    <br>
    <br>
    <strong>July 2</strong>
    <br>
    <em>The Fantastic Mr. Fox </em>

我写了80%的代码 - 它只输出每个日期下列出的最后一部电影。因此,如果有多部电影是<strong>(即“日期”)标签的兄弟,则显然会覆盖字典。

我想要做的是找到日期变量,对于每个日期变量,保持日期值不变,在变量中查找并存储loc / title / amenities值,如果/当我们遇到另一个日期值(“” )或标题(“”),我们写了我们对文件的字典 - 但如果它是标题(“”),我们继续滚动与我们首先贴上的相同日期值。

这是我的代码:

import csv
import re
from bs4 import BeautifulSoup
from urllib2 import urlopen

URL = 'http://www.thrillist.com/entertainment/chicago/free-outdoor-summer-movies-chicago'

html = urlopen(URL).read()
soup = BeautifulSoup(html, "lxml")

with open("MovieParks.tsv", "w") as f:
    categories = ['Location', 'Movie Title', 'Date', 'Amenities']
    writer = csv.DictWriter(f, delimiter = '\t', fieldnames = categories)
    writer.writeheader()

    root = soup.find_all("strong")
    for row in root:
        master_dict = {'Location':"", 'Movie Title':"", 'Date':"", 'Amenities':None}
        for sibling in list(row.next_siblings)[:-1]:
            Date = row.text.encode('utf-8')
            master_dict['Date'] = Date
            if sibling.name == "strong":                
                break
            if sibling.name == "em":
                MovieTitle = sibling.text.encode('utf-8')
                master_dict['Movie Title'] = MovieTitle
                if sibling.next_sibling == "em":
                    writer.writerow(master_dict)
                    break
                    sibling = sibling.next_sibling
            if 'Location:' in sibling:     
                Location = sibling.replace("Location: ","") + ", Chicago"
                master_dict['Location'] = Location.encode('utf-8')
            if 'Amenities:' in sibling:
                #not every item has Amenities listed
                Amenities = sibling.replace("Amenities: ","")
                master_dict['Amenities'] = Amenities.encode('utf-8')


        writer.writerow(master_dict)

print 'Done here'

我的麻烦当前输出(仅列出网站上每个日期标题下的最后一部电影的信息):

Location    Movie Title Date    Amenities
Edgebrook Park, Chicago A League of Their Own   June 7  Family friendly activities and games. Also: crying is allowed.
Gage Park, Chicago  It's a Mad, Mad, Mad, Mad World June 9  Family friendly activities and games.
Commercial Club Playground, Chicago Despicable Me 2 June 12 Family friendly activities and games.

我情不自禁地感觉自己离我不远,只是无法弄清楚所需的控制流逻辑。

1 个答案:

答案 0 :(得分:1)

现在是开始重构的时候了。我将在一个日期内处理所有后续电影的逻辑重构为一个单独的方法:

def processMovies(em, date):
    master_dict = {'Location':"", 'Movie Title':"", 'Date':"", 'Amenities':None}
    MovieTitle = em.text.encode('utf-8')
    master_dict['Movie Title'] = MovieTitle
    master_dict['Date'] = date

    for sibling in em.next_siblings:
        if sibling.name == "strong":
            writer.writerow(master_dict)
            return
        if sibling.name == "em":
            writer.writerow(master_dict)
            processMovies(sibling, date)
            return
        if 'Location:' in sibling:     
            Location = sibling.replace("Location: ","") + ", Chicago"
            master_dict['Location'] = Location.encode('utf-8')
        if 'Amenities:' in sibling:
            #not every item has Amenities listed
            Amenities = sibling.replace("Amenities: ","")
            master_dict['Amenities'] = Amenities.encode('utf-8')

在主要方法中,您只需在每个<em>中找到第一部电影(<strong>标签),然后将电影传递给processMovies()

with open("MovieParks.tsv", "w") as f:
    categories = ['Location', 'Movie Title', 'Date', 'Amenities']
    writer = csv.DictWriter(f, delimiter = '\t', fieldnames = categories)
    writer.writeheader()

    root = soup.find_all("strong")
    for row in root:  
        date = row.text.encode('utf-8')
        movie = row.find_next_sibling('em')
        processMovies(movie, date)

以上代码为我成功将所有电影写入.tsv文件。