Question

我正在解析一个结构如下的页面：

<pre class="asdf">content a</pre>
<pre class="asdf">content b</pre>

# returns
content a
content b

我正在使用以下XPath来获取内容： "//pre[@class='asdf']/text()"

它运行良好，除非嵌套在<pre>标记内的任何元素，它不会连接它们：

<pre class="asdf">content <a href="http://stackoverflow.com"</a>a</a></pre>
<pre class="asdf">content b</pre>

# returns
content
content b

如果我使用这个XPath，我会得到随后的输出。 "//pre[@class='asdf']//text()"

content
a
content b

我不想要其中任何一个。我想在<pre>内获取所有文本，即使它有子项。我不关心标签是否被剥离 - 但我希望它连接在一起。

我该怎么做？我在python2中使用lxml.html.xpath，但我认为这不重要。 This answer to another question让我觉得child::可能与我的回答有关。

以下是一些可以重现它的代码。

from lxml import html

tree = html.fromstring("""
<pre class="asdf">content <a href="http://stackoverflow.com">a</a></pre>
<pre class="asdf">content b</pre>
""")
for row in tree.xpath("//*[@class='asdf']/text()"):
  print("row: ", row)

Answer 1

你应该使用

work with PySide：

.text_content(): 返回元素的文本内容，包括其子元素的文本内容，没有标记。

for row in tree.xpath("//*[@class='asdf']"):
    print("row: ", row.text_content())

演示：

>>> from lxml import html
>>> 
>>> tree = html.fromstring("""
... <pre class="asdf">content <a href="http://stackoverflow.com">a</a></pre>
... <pre class="asdf">content b</pre>
... """)
>>> for row in tree.xpath("//*[@class='asdf']"):
...     print("row: ", row.text_content())
... 
('row: ', 'content a')
('row: ', 'content b')

适当的xpath来汇总孩子的文本

1 个答案: