Question

之前已经问过这个问题，

这是HTML数据

<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>


<othertag>
 othertag data
</othertag>
<moretag>
 moretag data
</moretag>
....
repeating n times
....

我的目标是在内提取数据，而不会被 标记与其他数据分开

这是我的查询

//p//text() | //othertag//text() | //moretag//text()

这给了

('This is is some important data', 'even this data', 'this is useful too',
'othertag data','moretag data')

请注意，标记文字数据在输出中拆分？

我希望将其格式化为正确的单位，如下所示，

('This is is some important data even this data this is useful too',
'othertag data','moretag data')

如果不可能，我能以这种方式得到它吗？

('This is is some important <br> data even this data <br> this is useful too',
'othertag data','moretag data')

我不能使用join语句，因为很难有选择地将变量列表值连接到变量索引中（没有人可以预测会有多少 个标记，因此数据可能会被拆分变量时间）

我的尝试（在其他用户的帮助下）

string(//p//text()) | //othertag//text() | //moretag//text()

以上查询给出了XPATH错误

这一个，

import lxml.html, lxml.etree

    ns = lxml.etree.FunctionNamespace(None)

    def cat(context, a):
        return ''.join(a)
    ns['cat'] = cat

此查询也提供了InvalidType错误

cat(//p//text()) | //othertag//text() | //moretag//text()

我正在使用python 2.7

Answer 1

如果您愿意使用其他库，那么您可以使用BeautifulSoup。

演示 -

>>> s = """<p>
... This is some important data
... <br>
... Even this is data
... <br>
... this is useful too
... </p>
...
...
... <othertag>
...  othertag data
... </othertag>
... <moretag>
...  moretag data
... </moretag>"""

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s,'html.parser')

>>> soup.find('p').text
'\nThis is some important data\n\nEven this is data\n\nthis is useful too\n'

>>> print(soup.find('p').text)

This is some important data

Even this is data

this is useful too

Answer 2

您可以尝试使用以下自定义XPath函数：

演示代码：

import lxml.html, lxml.etree

source = '''your html here'''
doc = lxml.html.fromstring(source)
ns = lxml.etree.FunctionNamespace(None)

def cat(context, elements):
    return [''.join(e.xpath('.//text()')) for e in elements]
ns['concat-texts'] = cat

print repr(doc.xpath('concat-texts(//p)| //othertag//text() | //moretag//text()'))

示例HTML输入：

source = '''
<p>
This is some important data
<br>
Even this is data
<br>
this is useful too
</p>

<p>
foo
<br>
bar
<br>
baz
</p>

<othertag>
 othertag data
</othertag>
<moretag>
 moretag data
</moretag>
'''

输出

['\nThis is some important data\n\nEven this is data\n\nthis is useful too\n', '\nfoo\n\nbar\n\nbaz\n', '\n othertag data\n', '\n moretag data\n']

Answer 3

我知道这来晚了，但是有人可能仍然觉得它有用。我得到它的方法是通过替换原始html中的br标签。这是一个字节对象，因此必须对其进行解码和编码，但它的工作原理就像一个魅力：

SmallChange

此后，// p // text（））返回“这是一些重要数据，即使该数据也有用”。

Answer 4

你说：“我希望它格式化为适当的单位，如下所示，

（'这是一些重要数据，即使这些数据也很有用'， 'othertag data'，'moretag data'）“

但实际上，XPath不进行格式化。你建议你想要一个返回三个字符串的序列;格式化稍后完成。

你正在使用Python，我认为，这意味着你正在使用XPath 1.0。在XPath 1.0中，没有三个字符串的序列。您可以返回三个节点（p，othertag和moretag节点），然后提取这些节点的字符串值将成为Python问题而不是XPath问题。或者您可以在三个单独的调用中返回三个字符串：例如，string（// p）将为您提供第一个p元素的字符串值。

在你的问题中，你说数据是重复的。但是你没有说哪些数据是重复的。我没有清楚地了解您的真实源文档的样子。这可能就是为什么你的问题的答案，包括我的问题，都是如此不完整。

XPATH - 如何从<br/>标签中获取内部html数据？

4 个答案: