在python 3.3中从网页获取文本作为可迭代对象

时间:2014-01-31 00:08:53

标签: python html parsing beautifulsoup

我试图从使用Python 3.3的网页获取文本,然后在该文本中搜索某些字符串。当我找到匹配的字符串时,我需要保存以下文本。例如,我选择此页面:http://gatherer.wizards.com/Pages/Card/Details.aspx?name=Dark%20Prophecy 我需要在卡片信息中的每个类别(卡片文字,稀有性等)之后保存文本。 目前我使用漂亮的Soup但get_text导致UnicodeEncodeError并且不返回可迭代对象。以下是相关代码:

urlStr = urllib.request.urlopen(
    'http://gatherer.wizards.com/Pages/Card/Details.aspx?name=' + cardName
    ).read()

htmlRaw = BeautifulSoup(urlStr)

htmlText = htmlRaw.get_text

for line in htmlText:
    line = line.strip()
    if "Converted Mana Cost:" in line:
        cmc = line.next()
        message += "*Converted Mana Cost: " + cmc +"* \n\n"
    elif "Types:" in line:
        type = line.next()
        message += "*Type: " + type +"* \n\n"
    elif "Card Text:" in line:
        rulesText = line.next()
        message += "*Rules Text: " + rulesText +"* \n\n"
    elif "Flavor Text:" in line:
        flavor = line.next()
        message += "*Flavor Text: " + flavor +"* \n\n"
    elif "Rarity:" in line:
        rarity == line.next()
        message += "*Rarity: " + rarity +"* \n\n"

1 个答案:

答案 0 :(得分:0)

这是不正确的:

htmlText = htmlRaw.get_text

由于get_textBeautifulSoup类的一种方法,因此您将方法分配给htmlText,而不是其结果。它有一个属性变体,可以在这里做你想做的事情:

htmlText = htmlRaw.text

您还可以使用HTML解析器简单地删除标记,当您可以使用它来定位所需的数据时:

# unique id for the html section containing the card info
card_id = 'ctl00_ctl00_ctl00_MainContent_SubContent_SubContent_rightCol'

# grab the html section with the card info
card_data = htmlRaw.find(id=card_id)

# create a generator to iterate over the rows
card_rows = ( row for row in card_data.find_all('div', 'row') )

# create a generator to produce functions for retrieving the values
card_rows_getters = ( lambda x: row.find('div', x).text.strip() for row in card_rows )

# create a generator to get the values
card_values = ( (get('label'), get('value')) for get in card_rows_getters )

# dump them into a dictionary
cards = dict( card_values )

print cards

{u'Artist:': u'Scott Chou',
 u'Card Name:': u'Dark Prophecy',
 u'Card Number:': u'93',
 u'Card Text:': u'Whenever a creature you control dies, you draw a card and lose 1 life.',
 u'Community Rating:': u'Community Rating: 3.617 / 5\xa0\xa0(64 votes)',
 u'Converted Mana Cost:': u'3',
 u'Expansion:': u'Magic 2014 Core Set',
 u'Flavor Text:': u'When the bog ran short on small animals, Ekri turned to the surrounding farmlands.',
 u'Mana Cost:': u'',
 u'Rarity:': u'Rare',
 u'Types:': u'Enchantment'}

现在你有一个你想要的信息的字典(加上一些额外的),这将更容易处理。