Question

如何检索标记内包含的所有HTML？

hxs = HtmlXPathSelector(response)
element = hxs.select('//span[@class="title"]/')
perhaps = hxs.select('//span[@class="title"]/html()')
html_of_tag = ?

编辑：如果我查看documentation，我只会看到返回新的xpathselectorlist的方法，或只返回标记内的原始文本。我想要检索标签内的新列表或文本，而不是源代码HTML 。 e.g：

<html>
    <head>
        <title></title>
    </head>
    <body>
        <div id="leexample">
            justtext
            <p class="ihatelookingforfeatures">
                sometext
            </p>
            <p class="yahc">
                sometext
            </p>
        </div>
        <div id="lenot">
            blabla
        </div>
    an awfuly long example for this.
    </body>
</html>

我想做一个像hxs.select('//div[@id="leexample"]/html()')这样的方法，让我回复它里面的HTML，就像这样：

justtext
<p class="ihatelookingforfeatures">
    sometext
</p>
<p class="yahc">
    sometext
</p>

我希望我清除了围绕我的问题的模糊性。

如何从Scrapy中的HtmlXPathSelector获取HTML？（也许解决方案外部scrapy的范围？）

Answer 1

在.extract()上致电XpathSelectorList。它将返回包含所需HTML内容的unicode字符串列表。

hxs.select('//div[@id="leexample"]/*').extract()

更新

# This is wrong
hxs.select('//div[@id="leexample"]/html()').extract()

/html()不是有效的scrapy选择器。要提取所有儿童，请使用'//div[@id="leexample"]/*'或'//div[@id="leexample"]/node()'。请注意，node()将返回textNode，结果类似于：

[u'\n   ',
 u'<a href="image1.html">Name: My image 1 
'
]

Answer 2

使用：

//span[@class="title"]/node()

这将选择span属性值为class的XML文档中任何"title"个元素的子节点的所有节点（元素，文本节点，处理指令和注释）

如果您只想获取文档中第一个span的子节点，请使用：

(//span[@class="title"])[1]/node()

Answer 3

虽然已经很晚了，但我将此留作记录。

我的工作是：

html = ''.join(hxs.select('//span[@class="title"]/node()').extract())

或者，如果我们想要匹配各种节点：

elements = hxs.select('//span[@class="title"]')
html = [''.join(e) for e in elements.select('./node()')]

Answer 4

模拟@xiaowl指出的内容，使用hxs.select('//div[@id="leexample"]').extract()将检索从xPath查询中检索到的标记的所有HTML内容：//div[@id="leexample"]。

所以为了记录，我最终得到了;

post = postItem() #body = Field #/in item.py
post['body'] = hxs.select('//span[@id="edit' + self.postid+ '"]').extract()
open('logs/test.log', 'wb').write(str(post['body']))
#logs.test.log contains all the HTML inside the tag selected by the query.

Answer 5

它实际上并不像看起来那么难。只需删除最终的XPath查询，然后使用extract（）方法。我在scrapy shell中运行了一个示例，这是一个缩短的版本：

sjaak:~ sjaakt$ scrapy shell
2012-07-19 11:06:21+0200 [scrapy] INFO: Scrapy 0.14.4 started (bot: scrapybot)
>>> fetch('http://www.nu.nl')
2012-07-19 11:06:34+0200 [default] INFO: Spider opened
2012-07-19 11:06:34+0200 [default] DEBUG: Crawled (200) <GET http://www.nu.nl> (referer: None)
>>> hxs.select("//h1").extract()
[u'<h1>    <script type="text/javascript">document.write(NU.today())</script>.\n    Het laatste nieuws het eerst op NU.nl    </h1>\n    ']
>>>

要仅获取标记的内部内容，请对XPath查询使用add / *。例如：

>>> hxs.select("//h1/*").extract()
[u'<script type="text/javascript">document.write(NU.today())</script>.\n    Het laatste nieuws het eerst op NU.nl    ']

Answer 6

有点黑客攻击（进入_root的私有财产Selector，在1.0.5中工作：

from lxml import html
def extract_inner_html(sel):
    return (sel._root.text or '') + ''.join([html.tostring(child) for child in sel._root.iterdescendants()])

def extract_inner_text(sel):
    return (''.join(sel.css('::text').extract())).strip()

使用它像：

reason = extract_inner_html(statement.css(".politic-rating .rate-reason")[0])
text = extract_inner_text(statement.css('.politic-statement')[0])
all_text = extract_inner_text(statement.css('.politic-statement'))

我找到了lxml代码部分in this question。

Scrapy：html xpath选择器返回结果为html？

6 个答案:

更新