在python中通过xpath从url源提取unicode数据

时间:2014-11-15 11:10:49

标签: python xpath unicode lxml

我想提取unicode表单

<div class="" id="messageContent">\xd8\xaf\xd8\xb1</div>

我尝试的是:

import requests
from lxml import html
post_data=...
post_response=requests.post(url='http://example.com/', data=post_data)
out=post_response.text
tree=html.fromstring(out)
print out.xpath('//div/[@id="messageContent"]/text()')

更新

然后我得到了错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 1447, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:41728)
  File "xpath.pxi", line 321, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:117734)
  File "xpath.pxi", line 239, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:116911)
  File "xpath.pxi", line 225, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:116780)
lxml.etree.XPathEvalError: Invalid expression

我想要messageContent的输出:

\xd8\xaf\xd8\xb1

2 个答案:

答案 0 :(得分:1)

错误非常明确:变量out存储unicode对象,而不是具有xpath属性的对象。您可能只是混淆了outtree

print out # will give you the whole text
print tree.xpath(...)  # will probably print what you were looking for

它与您尝试提取的“unicode数据”没有任何关系。

答案 1 :(得分:1)

你可能想说tree.xpath(...)