我试图从使用Python 3.3的网页获取文本,然后在该文本中搜索某些字符串。当我找到匹配的字符串时,我需要保存以下文本。例如,我选择此页面:http://gatherer.wizards.com/Pages/Card/Details.aspx?name=Dark%20Prophecy 我需要在卡片信息中的每个类别(卡片文字,稀有性等)之后保存文本。 目前我使用漂亮的Soup但get_text导致UnicodeEncodeError并且不返回可迭代对象。以下是相关代码:
urlStr = urllib.request.urlopen(
'http://gatherer.wizards.com/Pages/Card/Details.aspx?name=' + cardName
).read()
htmlRaw = BeautifulSoup(urlStr)
htmlText = htmlRaw.get_text
for line in htmlText:
line = line.strip()
if "Converted Mana Cost:" in line:
cmc = line.next()
message += "*Converted Mana Cost: " + cmc +"* \n\n"
elif "Types:" in line:
type = line.next()
message += "*Type: " + type +"* \n\n"
elif "Card Text:" in line:
rulesText = line.next()
message += "*Rules Text: " + rulesText +"* \n\n"
elif "Flavor Text:" in line:
flavor = line.next()
message += "*Flavor Text: " + flavor +"* \n\n"
elif "Rarity:" in line:
rarity == line.next()
message += "*Rarity: " + rarity +"* \n\n"
答案 0 :(得分:0)
这是不正确的:
htmlText = htmlRaw.get_text
由于get_text
是BeautifulSoup
类的一种方法,因此您将方法分配给htmlText
,而不是其结果。它有一个属性变体,可以在这里做你想做的事情:
htmlText = htmlRaw.text
您还可以使用HTML解析器简单地删除标记,当您可以使用它来定位所需的数据时:
# unique id for the html section containing the card info
card_id = 'ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_rightCol'
# grab the html section with the card info
card_data = htmlRaw.find(id=card_id)
# create a generator to iterate over the rows
card_rows = ( row for row in card_data.find_all('div', 'row') )
# create a generator to produce functions for retrieving the values
card_rows_getters = ( lambda x: row.find('div', x).text.strip() for row in card_rows )
# create a generator to get the values
card_values = ( (get('label'), get('value')) for get in card_rows_getters )
# dump them into a dictionary
cards = dict( card_values )
print cards
{u'Artist:': u'Scott Chou',
u'Card Name:': u'Dark Prophecy',
u'Card Number:': u'93',
u'Card Text:': u'Whenever a creature you control dies, you draw a card and lose 1 life.',
u'Community Rating:': u'Community Rating: 3.617 / 5\xa0\xa0(64 votes)',
u'Converted Mana Cost:': u'3',
u'Expansion:': u'Magic 2014 Core Set',
u'Flavor Text:': u'When the bog ran short on small animals, Ekri turned to the surrounding farmlands.',
u'Mana Cost:': u'',
u'Rarity:': u'Rare',
u'Types:': u'Enchantment'}
现在你有一个你想要的信息的字典(加上一些额外的),这将更容易处理。