我有像这样的HTML文本
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>
<othertag>
data
</othertag>
<moretag>
data
</moretag>
我尝试使用XPATH查询以下内容
//p//text() | //othertag//text() | //moretag//text()
它为我提供了在每个<br>
标记
('This is some important data','Even this is data','this is useful too','othetag text data','moretag text data')
我希望它是一个完整的字符串,
('This is some important data Even this is data this is useful too')
因为我将使用|
Union XPATH运算符查询其他元素,而且非常重要的是此文本内容已正确划分
我该怎么做?
如果这是不可能的,
我能不能以某种方式得到<p>
的内部HTML
这样我就可以以文本方式将其存储为
This is some important data<br>Even this is data<br>this is useful too
我在lxml.html
Python 2.7
答案 0 :(得分:2)
<强>更新强>
根据您的编辑,您可以使用XPath string()
功能。例如:
>>> doc.xpath('string(//p)')
'\n This is some important data\n \n Even this is data\n \n this is useful too\n '
(原始答案如下)
如果你想要多个部分找回你想要的文字:
('This is some important data','Even this is data','this is useful too')
为什么不加入这些作品?
>>> ' '.join(doc.xpath('//p/text()'))
['\n This is some important data\n ', '\n Even this is data\n ', '\n this is useful too\n ']
你甚至可以摆脱换行符:
>>> ' '.join(x.strip() for x in doc.xpath('//p/text()'))
'This is some important data Even this is data this is useful too'
如果你想要&#34;内部html&#34;在p
元素中,您可以在其所有孩子身上调用lxml.etree.tostring
:
>>> ''.join(etree.tostring(x) for x in doc.xpath('//p')[0].getchildren())
'<br/>\n Even this is data\n <br/>\n this is useful too\n '
注意:所有这些例子都假定:
>>> from lxml import etree
>>> doc = etree.parse(open('myfile.html'),
... parser=etree.HTMLParser())
答案 1 :(得分:2)
您还可以在XPath中公开自己的函数:
import lxml.html, lxml.etree
raw_doc = '''
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>
'''
doc = lxml.html.fromstring(raw_doc)
ns = lxml.etree.FunctionNamespace(None)
def cat(context, a):
return [''.join(a)]
ns['cat'] = cat
print repr(doc.xpath('cat(//p/text())'))
打印
'\nThis is some important data\n\nEven this is data\n\nthis is useful too\n'
您可以使用此方法执行转换。