XPATH - 如何获得带有标签的内部文本数据?

时间:2015-07-27 14:02:49

标签: python xml xpath

我有像这样的HTML文本

<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>


<othertag>
 data
</othertag>
<moretag>
 data
</moretag>

我尝试使用XPATH查询以下内容

//p//text() | //othertag//text() | //moretag//text()

它为我提供了在每个<br>标记

点处断开的文本 像这样

('This is some important data','Even this is data','this is useful too','othetag text data','moretag text data')

我希望它是一个完整的字符串,

('This is some important data Even this is data this is useful too')

因为我将使用| Union XPATH运算符查询其他元素,而且非常重要的是此文本内容已正确划分

我该怎么做?

如果这是不可能的,

我能不能以某种方式得到<p>的内部HTML

这样我就可以以文本方式将其存储为

This is some important data<br>Even this is data<br>this is useful too

我在lxml.html

中使用Python 2.7

2 个答案:

答案 0 :(得分:2)

<强>更新

根据您的编辑,您可以使用XPath string()功能。例如:

>>> doc.xpath('string(//p)')
'\n    This is some important data\n    \n    Even this is data\n    \n    this is useful too\n  '

(原始答案如下)

如果你想要多个部分找回你想要的文字:

('This is some important data','Even this is data','this is useful too')

为什么不加入这些作品?

>>> ' '.join(doc.xpath('//p/text()'))
['\n    This is some important data\n    ', '\n    Even this is data\n    ', '\n    this is useful too\n  ']

你甚至可以摆脱换行符:

>>> ' '.join(x.strip() for x in doc.xpath('//p/text()'))
'This is some important data Even this is data this is useful too'

如果你想要&#34;内部html&#34;在p元素中,您可以在其所有孩子身上调用lxml.etree.tostring

>>> ''.join(etree.tostring(x) for x in doc.xpath('//p')[0].getchildren())
'<br/>\n    Even this is data\n    <br/>\n    this is useful too\n  '

注意:所有这些例子都假定:

>>> from lxml import etree
>>> doc = etree.parse(open('myfile.html'),
...    parser=etree.HTMLParser())

答案 1 :(得分:2)

您还可以在XPath中公开自己的函数:

import lxml.html, lxml.etree

raw_doc = '''
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>
'''

doc = lxml.html.fromstring(raw_doc)
ns = lxml.etree.FunctionNamespace(None)

def cat(context, a):
    return [''.join(a)]
ns['cat'] = cat

print repr(doc.xpath('cat(//p/text())'))

打印

'\nThis is some important data\n\nEven this is data\n\nthis is useful too\n'

您可以使用此方法执行转换。