我正在尝试从这个HTML编译一个dict,它有一个不寻常的平面HTML。格式也很特殊,因为它没有提供每个电影片名的日期,因为当天播放的电影片名在日期下列出(有些有一个,有些有多个)。
这是HTML的片段:
<div class="caption">
<strong>July 1</strong>
<br>
<em>Top Gun</em>
<br>
"Location: Millennium Park"
<br>
"Amenities: Please be a volleyball tournament..."
<br>
<em>Captain Phillips</em>
<br>
"Location: Montgomery Ward Park"
<br>
<br>
<strong>July 2</strong>
<br>
<em>The Fantastic Mr. Fox </em>
我写了80%的代码 - 它只输出每个日期下列出的最后一部电影。因此,如果有多部电影是<strong>
(即“日期”)标签的兄弟,则显然会覆盖字典。
我想要做的是找到日期变量,对于每个日期变量,保持日期值不变,在变量中查找并存储loc / title / amenities值,如果/当我们遇到另一个日期值(“” )或标题(“”),我们写了我们对文件的字典 - 但如果它是标题(“”),我们继续滚动与我们首先贴上的相同日期值。
这是我的代码:
import csv
import re
from bs4 import BeautifulSoup
from urllib2 import urlopen
URL = 'http://www.thrillist.com/entertainment/chicago/free-outdoor-summer-movies-chicago'
html = urlopen(URL).read()
soup = BeautifulSoup(html, "lxml")
with open("MovieParks.tsv", "w") as f:
categories = ['Location', 'Movie Title', 'Date', 'Amenities']
writer = csv.DictWriter(f, delimiter = '\t', fieldnames = categories)
writer.writeheader()
root = soup.find_all("strong")
for row in root:
master_dict = {'Location':"", 'Movie Title':"", 'Date':"", 'Amenities':None}
for sibling in list(row.next_siblings)[:-1]:
Date = row.text.encode('utf-8')
master_dict['Date'] = Date
if sibling.name == "strong":
break
if sibling.name == "em":
MovieTitle = sibling.text.encode('utf-8')
master_dict['Movie Title'] = MovieTitle
if sibling.next_sibling == "em":
writer.writerow(master_dict)
break
sibling = sibling.next_sibling
if 'Location:' in sibling:
Location = sibling.replace("Location: ","") + ", Chicago"
master_dict['Location'] = Location.encode('utf-8')
if 'Amenities:' in sibling:
#not every item has Amenities listed
Amenities = sibling.replace("Amenities: ","")
master_dict['Amenities'] = Amenities.encode('utf-8')
writer.writerow(master_dict)
print 'Done here'
我的麻烦当前输出(仅列出网站上每个日期标题下的最后一部电影的信息):
Location Movie Title Date Amenities
Edgebrook Park, Chicago A League of Their Own June 7 Family friendly activities and games. Also: crying is allowed.
Gage Park, Chicago It's a Mad, Mad, Mad, Mad World June 9 Family friendly activities and games.
Commercial Club Playground, Chicago Despicable Me 2 June 12 Family friendly activities and games.
等
我情不自禁地感觉自己离我不远,只是无法弄清楚所需的控制流逻辑。
答案 0 :(得分:1)
现在是开始重构的时候了。我将在一个日期内处理所有后续电影的逻辑重构为一个单独的方法:
def processMovies(em, date):
master_dict = {'Location':"", 'Movie Title':"", 'Date':"", 'Amenities':None}
MovieTitle = em.text.encode('utf-8')
master_dict['Movie Title'] = MovieTitle
master_dict['Date'] = date
for sibling in em.next_siblings:
if sibling.name == "strong":
writer.writerow(master_dict)
return
if sibling.name == "em":
writer.writerow(master_dict)
processMovies(sibling, date)
return
if 'Location:' in sibling:
Location = sibling.replace("Location: ","") + ", Chicago"
master_dict['Location'] = Location.encode('utf-8')
if 'Amenities:' in sibling:
#not every item has Amenities listed
Amenities = sibling.replace("Amenities: ","")
master_dict['Amenities'] = Amenities.encode('utf-8')
在主要方法中,您只需在每个<em>
中找到第一部电影(<strong>
标签),然后将电影传递给processMovies()
:
with open("MovieParks.tsv", "w") as f:
categories = ['Location', 'Movie Title', 'Date', 'Amenities']
writer = csv.DictWriter(f, delimiter = '\t', fieldnames = categories)
writer.writeheader()
root = soup.find_all("strong")
for row in root:
date = row.text.encode('utf-8')
movie = row.find_next_sibling('em')
processMovies(movie, date)
以上代码为我成功将所有电影写入.tsv
文件。