如何从lxml text_content()中排除由特定标签锚定的文本

时间:2019-04-06 13:39:31

标签: python xpath lxml

我知道也有类似的问题,但是由于它们不能解决问题,请忍受我为什么还要再处理一次该问题。

这是我的字符串:

normal = """
  <p>
    <b>
      <a href='link1'>        Forget me  </a>
    </b>     I need this one      <br>
    <b>
     <a href='link2'>  Forget me too  </a>
    </b> Forget me not <i>even when</i> you go to sleep <br>
    <b>  <a href='link3'>  Forget me three  </a>
    </b>  Foremost on your mind <br>
   </p>    
"""

我从开始:

target = lxml.html.fromstring(normal)
tree_struct = etree.ElementTree(target)  

现在,我基本上需要忽略<a>标记锚定的所有内容。但是,如果我运行以下代码:

for e in target.iter():
   item = target.xpath(tree_struct.getpath(e))
   if len(item)>0:
       print(item[0].text)  

我什么也没得到;另一方面,如果我将print指令更改为:

  print(item[0].text_content()) 

我得到以下输出:

Forget me
 I need this one

 Forget me too

Forget me not
even when
you go to sleep


 Forget me three

Foremost on your mind 

我想要的输出是:

 I need this one

Forget me not
even when
you go to sleep    

Foremost on your mind 

除了提供错误的输出外,它也不雅致。因此,尽管我无法弄清楚是什么,但我必须缺少明显的东西。

1 个答案:

答案 0 :(得分:1)

我认为您正在使这不必要地复杂。无需创建tree_struct对象并使用getpath()。这是一个建议:

  
from lxml import html

normal = """
  <p>
    <b>
      <a href='link1'>        Forget me  </a>
    </b>     I need this one      <br>
    <b>
     <a href='link2'>  Forget me too  </a>
    </b> Forget me not <i>even when</i> you go to sleep <br>
    <b>  <a href='link3'>  Forget me three  </a>
    </b>  Foremost on your mind <br>
   </p>
"""

target = html.fromstring(normal)

for e in target.iter():
    if not e.tag == "a":
        # Print text content if not only whitespace 
        if e.text and e.text.strip():
            print(e.text.strip())
        # Print tail content if not only whitespace
        if e.tail and e.tail.strip():
            print(e.tail.strip())

输出:

 
I need this one
Forget me not
even when
you go to sleep
Foremost on your mind