Question

假设我有很多像这样的html脚本：

<div style="clear:both" id="novelintro" itemprop="description">you are foolish!<font color=red size=4>I am superman!</font></div>

我想使用xpath来提取文本：你是愚蠢的！我是超人！

但是，如果我使用

xpath('//div[@id="novelintro"]/text()').extract()

我只能得到“你是愚蠢的！”

我用的时候：

xpath('//div[@id="novelintro"]/font/text()').extract()"

我只能得到“我是超人！”

所以，如果你只能使用一个xpath表达式来提取整个句子，那就是“你是傻瓜！我是超人！”

更不走运的是，在上面的html脚本中，它是“<font>”标签，但在我的其他脚本中，还有许多其他标签，例如：

提取“嗨女孩我爱你！”在以下脚本中： <div style="clear:both" id="novelintro" itemprop="description">hi girl<legend >I love you!</legend></div>

提取“如果我嫁给你的母亲，那我就是你的父亲！”在以下脚本中：

<div style="clear:both" id="novelintro" itemprop="description">If I<legend > marry your mother<div>then I am your father!</div></legend></div>

如果你只能使用一个xpath表达式来调整所有的html脚本吗？

Answer 1

如果您的文件是：

<outer>This is outer text.<inner>And this is inner text.</inner>More outer text.</outer>

并使用xpath表达式：/outer//text() （阅读：下面的任何文字＆＃39;外部＆＃39;），结果是一个类似的列表：

This is outer text. ----------------------- And this is inner text. ----------------------- More outer text.

Answer 2

您可以使用XPath的string()函数，该函数以递归方式将单个节点转换为字符串（可选.引用当前节点）：

from scrapy.selector import HtmlXPathSelector

def node_to_string(node):
    return node.xpath("string(.)").extract()[0]

# ------------------------------------------------------

body = """<body>
  <div style="clear:both" id="novelintro" itemprop="description">you are foolish!<font color=red size=4>I am superman!</font></div>
  <div style="clear:both" id="novelintro2" itemprop="description">hi girl<legend >I love you!</legend></div>
  <div style="clear:both" id="novelintro3" itemprop="description">If I<legend > marry your mother<div>then I am your father!</div></legend></div>
</body>"""

hxs = HtmlXPathSelector(text=body)

# single target use
print node_to_string(hxs.xpath('//div[@id="novelintro"]'))
print 

# multi target use
for div in hxs.xpath('//body/div'):
    print node_to_string(div)
print 

# alternatively
print [node_to_string(n) for n in hxs.xpath('//body/div')]
print

输出

you are foolish!I am superman!

you are foolish!I am superman!
hi girlI love you!
If I marry your motherthen I am your father!

[u'you are foolish!I am superman!', u'hi girlI love you!', u'If I marry your motherthen I am your father!']

请注意，由于源中缺少空格，因此会丢失空格。 string()以与浏览器相同的方式处理空格。

如何使用xpath在html脚本的多个标签中提取文本

2 个答案: