我知道也有类似的问题,但是由于它们不能解决问题,请忍受我为什么还要再处理一次该问题。
这是我的字符串:
normal = """
<p>
<b>
<a href='link1'> Forget me </a>
</b> I need this one <br>
<b>
<a href='link2'> Forget me too </a>
</b> Forget me not <i>even when</i> you go to sleep <br>
<b> <a href='link3'> Forget me three </a>
</b> Foremost on your mind <br>
</p>
"""
我从开始:
target = lxml.html.fromstring(normal)
tree_struct = etree.ElementTree(target)
现在,我基本上需要忽略<a>
标记锚定的所有内容。但是,如果我运行以下代码:
for e in target.iter():
item = target.xpath(tree_struct.getpath(e))
if len(item)>0:
print(item[0].text)
我什么也没得到;另一方面,如果我将print
指令更改为:
print(item[0].text_content())
我得到以下输出:
Forget me
I need this one
Forget me too
Forget me not
even when
you go to sleep
Forget me three
Foremost on your mind
我想要的输出是:
I need this one
Forget me not
even when
you go to sleep
Foremost on your mind
除了提供错误的输出外,它也不雅致。因此,尽管我无法弄清楚是什么,但我必须缺少明显的东西。
答案 0 :(得分:1)
我认为您正在使这不必要地复杂。无需创建tree_struct
对象并使用getpath()
。这是一个建议:
from lxml import html
normal = """
<p>
<b>
<a href='link1'> Forget me </a>
</b> I need this one <br>
<b>
<a href='link2'> Forget me too </a>
</b> Forget me not <i>even when</i> you go to sleep <br>
<b> <a href='link3'> Forget me three </a>
</b> Foremost on your mind <br>
</p>
"""
target = html.fromstring(normal)
for e in target.iter():
if not e.tag == "a":
# Print text content if not only whitespace
if e.text and e.text.strip():
print(e.text.strip())
# Print tail content if not only whitespace
if e.tail and e.tail.strip():
print(e.tail.strip())
输出:
I need this one
Forget me not
even when
you go to sleep
Foremost on your mind