我试图从使用Python 3.3的网页获取文本,然后在该文本中搜索某些字符串。当我找到匹配的字符串时,我需要保存以下文本。例如,我选择此页面:http://gatherer.wizards.com/Pages/Card/Details.aspx?name=Dark%20Prophecy 我需要在卡片信息中的每个类别(卡片文字,稀有性等)之后保存文本。 目前我使用漂亮的Soup但get_text导致UnicodeEncodeError并且不返回可迭代对象。以下是相关代码:
urlStr = urllib.request.urlopen(
'http://gatherer.wizards.com/Pages/Card/Details.aspx?name=' + cardName
).read()
htmlRaw = BeautifulSoup(urlStr)
htmlText = htmlRaw.get_text
for line in htmlText:
line = line.strip()
if "Converted Mana Cost:" in line:
cmc = line.next()
message += "*Converted Mana Cost: " + cmc +"* \n\n"
elif "Types:" in line:
type = line.next()
message += "*Type: " + type +"* \n\n"
elif "Card Text:" in line:
rulesText = line.next()
message += "*Rules Text: " + rulesText +"* \n\n"
elif "Flavor Text:" in line:
flavor = line.next()
message += "*Flavor Text: " + flavor +"* \n\n"
elif "Rarity:" in line:
rarity == line.next()
message += "*Rarity: " + rarity +"* \n\n"
答案 0 :(得分:0)
我不再熟悉BeautifulSoup了,但是我运行了这段代码 - 不是给你一个完整的答案,而是指出你正确的方向
import urllib
from lxml import html
mypage = urllib.urlopen('http://gatherer.wizards.com/Pages/Card/Details.aspx?multiverseid=264')
dir(mypage)
['__doc__', '__init__', '__iter__', '__module__', '__repr__', 'close', 'code', 'fileno', 'fp', 'getcode', 'geturl', 'headers', 'info', 'next', 'read', 'readline', 'readlines', 'url']
page = mypage.readlines()
len(page)
526
page[0]
'<?xml version="1.0" encoding="utf-8" ?>\r\n'
string = ''.join([apage for apage in page])
tree = html.fromstring(string)
elements = [e for e in tree.iter()]
for e in elements:
if 'cardtextbox' in e.values():
e, e.text_content()
(<Element div at 0x31a7ba0>, 'Enchant creature')
(<Element div at 0x31a7bf8>, "Enchanted creature has protection from red. This effect doesn't remove Red Ward.")
我显然不知道自己在做什么,但我正在嘲笑它。
在我看来,你试图识别的值是属性词典的值,所以我知道这一点很多。如果要列出你想要识别的所有属性,还需要更多的考虑,但我认为这应该让你开始。
答案 1 :(得分:0)