将包含多个类别的页面压缩到csv中。成功将第一个类别放入一列,但第二列数据没有写入csv。我正在使用的代码:
import urllib2
import csv
from bs4 import BeautifulSoup
url = "http://digitalstorage.journalism.cuny.edu/sandeepjunnarkar/tests/jazz.html"
page = urllib2.urlopen(url)
soup_jazz = BeautifulSoup(page)
all_years = soup_jazz.find_all("td",class_="views-field views-field-year")
all_category = soup_jazz.find_all("td",class_="views-field views-field-category-code")
with open("jazz.csv", 'w') as f:
csv_writer = csv.writer(f)
csv_writer.writerow([u'Year Won', u'Category'])
for years in all_years:
year_won = years.string
if year_won:
csv_writer.writerow([year_won.encode('utf-8')])
for categories in all_category:
category_won = categories.string
if category_won:
csv_writer.writerow([category_won.encode('utf-8')])
它正在将列标题写入第二列而不是category_won。
根据您的建议,我将其编译为:
with open("jazz.csv", 'w') as f:
csv_writer = csv.writer(f)
csv_writer.writerow([u'Year Won', u'Category'])
for years, categories in zip(all_years, all_category):
year_won = years.string
category_won = categories.string
if year_won and category_won:
csv_writer.writerow([year_won.encode('utf-8'), category_won.encode('utf-8')])
但我现在收到以下错误:
csv_writer.writerow([year_won.encode('utf-8'),category_won.encode('utf-8')]) ValueError:关闭文件的I / O操作
答案 0 :(得分:0)
您可以zip()
两个列表:
for years, categories in zip(all_years, all_category):
year_won = years.string
category_won = categories.string
if year_won and category_won:
csv_writer.writerow([year_won.encode('utf-8'), category_won.encode('utf-8')])
不幸的是,HTML页面有些破碎,您无法像预期的那样搜索表格行。
接下来最好的事情是搜索岁月,然后找到兄弟姐妹细胞:
soup_jazz = BeautifulSoup(page)
with open("jazz.csv", 'w') as f:
csv_writer = csv.writer(f)
csv_writer.writerow([u'Year Won', u'Category'])
for year_cell in soup_jazz.find_all('td', class_='views-field-year'):
year = year_cell and year_cell.text.strip().encode('utf8')
if not year:
continue
category = next((e for e in year_cell.next_siblings
if getattr(e, 'name') == 'td' and
'views-field-category-code' in e.attrs.get('class', [])),
None)
category = category and category.text.strip().encode('utf8')
if year and category:
csv_writer.writerow([year, category])
这会产生:
Year Won,Category
2012,Best Improvised Jazz Solo
2012,Best Jazz Vocal Album
2012,Best Jazz Instrumental Album
2012,Best Large Jazz Ensemble Album
....
1960,Best Jazz Composition Of More Than Five Minutes Duration
1959,Best Jazz Performance - Soloist
1959,Best Jazz Performance - Group
1958,"Best Jazz Performance, Individual"
1958,"Best Jazz Performance, Group"