我想知道如何使用lxml来获取网址,然后我可以使用xpath来解析我想要的数据。
请指导我,非常感谢。
res = requests.get('http://www.ipeen.com.tw/comment/778246')
doc = parse(res.content)
name = doc.xpath("//meta[@itemprop='name']/@content")
print name
我的代码中存在错误:
doc = parse(res.content)
File "/Users/ome/djangoenv/lib/python2.7/site-packages/lxml/html/__init__.py", line 786, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src/lxml/lxml.etree.c:72655)
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:106263)
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106564)
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105561)
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100456)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94543)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:96003)
File "parser.pxi", line 618, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:95015)
IOError
答案 0 :(得分:2)
res.content
是一个字符串,一个HTML字符串。
您需要使用lxml.html.fromstring()
:
import lxml.html
import requests
res = requests.get('http://www.ipeen.com.tw/comment/778246')
doc = lxml.html.fromstring(res.content)
name = doc.xpath(".//meta[@itemprop='name']/@content")
print name
答案 1 :(得分:0)
据推测,res.content
是一个包含页面内容的字符串。 parse
采用文件名或文件类对象。因此,您使用页面内容作为文件的名称。这可能不是你想要的。要从字符串构造树,请使用fromstring
而不是parse
。