我正在尝试使用漂亮的汤来刮取基于原子的RSS提要,但事实证明这很困难。捕获数据就好了,直到<item>
出现,破坏了代码并导致脚本崩溃。这样的<item>
s始终有标签(firefox将它们标记为橙色),如“&amp; lt;”或“&amp; quot;”,而没有它们的s工作正常。我尝试了一些像BeautifulStoneSoup这样的东西,用正则表达式删除特殊字符,并设置“xml”参数,但没有任何作用,通常他们只是抛出一个关于在BS4中被弃用的警告。
为什么会出现这些角色?如何有效地处理它们?
这是我试图抓的一个页面: http://www.thestar.com/feeds.articles.news.gta.rss
这是我的代码:
news_url = "http://www.thestar.com/feeds.articles.news.gta.rss" # Toronto Star RSS Feed
try:
news_rss = urllib2.urlopen(news_url)
news = news_rss.read()
news_rss.close()
soup = BeautifulSoup(news)
except:
return "error"
titles = soup.findAll('title')
links = soup.findAll('link')
for link in links:
link = link.contents # I want the url without the <link> tags
news_stuff = []
for item in titles:
if item.text == "TORONTO STAR | NEWS | GTA": # These have <title> tags and I don't want them; just skip 'em.
pass
else:
news_stuff.append((item.text, links[i])) # Here's a news story. Grab it.
i = 0
for thing in news_stuff:
print '<a href="'
print thing[1]
print '"target="_blank">'
print thing[0]
print '</a><br/>'
i += 1
答案 0 :(得分:2)
不确定您正在讨论哪个问题,但在运行代码时出现此错误:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2018' in position 54: ordinal not in range(128)
要修复它,我只是添加了编码:
for thing in news_stuff:
print '<a href="'
print thing[1]
print '"target="_blank">'
print thing[0].encode("utf-8")
print '</a><br/>'
i += 1
执行该脚本后没有任何错误。
答案 1 :(得分:1)
这是我尝试过的,并没有崩溃。
from string import punctuation, whitespace
import urllib2
import datetime
import re
import MySQLdb
import csv
from bs4 import BeautifulSoup as Soup
news_url = "http://www.thestar.com/feeds.articles.news.gta.rss" # Toronto Star RSS Feed
news_rss = urllib2.urlopen(news_url)
news = news_rss.read()
news_rss.close()
soup = Soup(news)
titles = soup.findAll('title')
links = soup.findAll('link')
for link in links:
link = link.contents # I want the url without the <link> tags
i=0
news_stuff = []
for item in titles:
if item.text == "TORONTO STAR | NEWS | GTA": # These have <title> tags and I don't want them; just skip 'em.
pass
else:
news_stuff.append((item.text, links[i])) # Here's a news story. Grab it.
i = 0
for thing in news_stuff:
print '<a href="'
print thing[1]
print '"target="_blank">'
print thing[0]
print '</a><br/>'
i += 1
这是我得到的输出
<a href="
<link>http://www.thestar.com/feeds.articles.news.gta.rss</link>
"target="_blank">
TTC argues for return of special constables
</a><br/>
<a href="
<link>http://www.thestar.com/feeds.articles.news.gta.rss</link>
"target="_blank">
Health information of 18,000 people stolen in Peel Region
</a><br/>
<a href="
<link>http://www.thestar.com/feeds.articles.news.gta.rss</link>
"target="_blank">
Fire closes Bathurst St. south of Dupont
</a><br/>
<a href="
<link>http://www.thestar.com/feeds.articles.news.gta.rss</link>
"target="_blank">
Empty tanker train cars derail in Brampton
</a><br/>
<a href="
<link>http://www.thestar.com/feeds.articles.news.gta.rss</link>
"target="_blank">
Medical illustration studios flourish in Toronto
</a><br/>
<a href="
<link>http://www.thestar.com/feeds.articles.news.gta.rss</link>
"target="_blank">
In Texas, Toronto music leaders urge city hall to say ‘yes’
</a><br/>
<a href="
<link>http://www.thestar.com/feeds.articles.news.gta.rss</link>
"target="_blank">
Making sense of the Sammy Yatim shooting: Fiorito
</a><br/>
<a href="
<link>http://www.thestar.com/feeds.articles.news.gta.rss</link>
"target="_blank">
Toronto’s chief planner, Jennifer Keesmaat, challenges Mirvish/Gehry scheme: Hume
</a><br/>
<a href="
<link>http://www.thestar.com/feeds.articles.news.gta.rss</link>
"target="_blank">
Westbound Gardiner lanes reopen after rollover near Spadina
</a><br/>
<a href="
<link>http://www.thestar.com/feeds.articles.news.gta.rss</link>
"target="_blank">
Daycare Crisis: Halton health complaints show gaps in unlicensed care
</a><br/>
<a href="
<link>http://www.thestar.com/feeds.articles.news.gta.rss</link>
"target="_blank">
Witness describes shooting details as man confronted police near van
</a><br/>
<a href="
<link>http://www.thestar.com/feeds.articles.news.gta.rss</link>
"target="_blank">
Muslim AIDS activist honoured for taboo-busting work
</a><br/>
<a href="
<link>http://www.thestar.com/feeds.articles.news.gta.rss</link>
"target="_blank">
Death to death with dignity: DiManno
</a><br/>
<a href="
<link>http://www.thestar.com/feeds.articles.news.gta.rss</link>
"target="_blank">
Rockers join forces in Line 9 protest
</a><br/>
<a href="
<link>http://www.thestar.com/feeds.articles.news.gta.rss</link>
"target="_blank">
Could you eat 10 pizzas in 12 minutes? This guy did
</a><br/>
<a href="
<link>http://www.thestar.com/feeds.articles.news.gta.rss</link>
"target="_blank">
Former participants speak up about gay healing program
</a><br/>
<a href="
<link>http://www.thestar.com/feeds.articles.news.gta.rss</link>
"target="_blank">
Freed Canadians Tarek Loubani and John Greyson awaiting papers to come home from Egypt
</a><br/>
<a href="
<link>http://www.thestar.com/feeds.articles.news.gta.rss</link>
"target="_blank">
Man dies after crash at Finch and Dufferin
</a><br/>
<a href="
<link>http://www.thestar.com/feeds.articles.news.gta.rss</link>
"target="_blank">
Nuit Blanche lights up Toronto Saturday night
</a><br/>
<a href="
<link>http://www.thestar.com/feeds.articles.news.gta.rss</link>
"target="_blank">
Leafs fans celebrate home opener at Maple Leaf Square
</a><br/>