Question

假设我有类似HTML格式的纯文本，如下所示：

<div id="foo"><p id="bar">Some random text</p></div>

我需要能够在其上运行XPath来检索一些内部元素。如何将纯文本转换为可以在其上使用XPath的某种对象？

Answer 1

您可以使用普通选择器在其上直接运行相同的xpath，css查询：

from scrapy import Selector

...

sel = Selector(text="<div id="foo"><p id="bar">Some random text</p></div>")
selected_xpath = sel.xpath('//div[@id="foo"]')

Answer 2

您可以将HTML代码示例作为字符串传递给lxml.html并使用XPath进行解析：

from lxml import html

code = """<div id="foo"><p id="bar">Some random text</p></div>"""
source = html.fromstring(code)
source.xpath('//div/p/text()')

Answer 3

Andersson已经发布了我的问题的解决方案。这是我刚刚发现的第二个方法，它也很好用，并且使用了Scrapy的类，从而可以使用Scrapy用户已经熟悉的所有方法（例如extract（），extract_first（）等）。

text = """<div id="foo"><p id="bar">Some random text</p></div>"""
#First, we need to encode the text
text_encoded = text.encode('utf-8')
#Now, convert it to a HtmlResponse object
text_in_html = HtmlResponse(url='some url', body=text_encoded, encoding='utf-8')
#Now we can use XPath normally as if the text was a common HTML response
text_in_html.xpath(//p/text()).extract_first()

Scrapy-如何将字符串转换为可以在其上使用XPath的对象？

3 个答案: