使用python中的lxml解析标记内的所有文本

时间:2016-08-14 05:23:58

标签: python html parsing lxml

我正在尝试解析HTML文件,如下所示

var myArray = (byte[]) new ImageConverter().ConvertTo(InputImg, typeof(byte[]));

我尝试使用以下代码解析

<ol>
  <li>
    <div class="c1">
      <span class="s1">hi</span>
      " hello "
      <span class="s2">world!</span>
    </div>
  </li>
  <li>
    <div class="c2">
      <span class="s3">abc</span>
      " def ghijkl "
      <span class="s1">mno</span>
      " pqr!"
    </div>
  </li>
</ol>

我得到了结果

tree = html.fromstring(code.content)
sol = tree.xpath('//ol//text()')
for x in sol:
    print x

如何才能将每个hi hello world! abc def ghijkl mno pqr! 标记中的所有文字放在一行中。即我想要输出

<li>

2 个答案:

答案 0 :(得分:1)

$ cat a.py
from lxml import etree

xml = """<ol>
  <li>
    <div class="c1">
      <span class="s1">hi</span>
      " hello "
      <span class="s2">world!</span>
    </div>
  </li>
  <li>
    <div class="c2">
      <span class="s3">abc</span>
      " def ghijkl "
      <span class="s1">mno</span>
      " pqr!"
    </div>
  </li>
</ol>"""

tree = etree.fromstring(xml)
sol = tree.xpath('//ol//li')
for a in sol:
   print " ".join([t.strip() for t in a.itertext()]).strip()

$ python a.py
hi " hello " world!
abc " def ghijkl " mno " pqr!"

答案 1 :(得分:1)

您可以获取每个li并使用normalize-space

from lxml import html
h = """<ol>
  <li>
    <div class="c1">
      <span class="s1">hi</span>
      " hello "
      <span class="s2">world!</span>
    </div>
  </li>
  <li>
    <div class="c2">
      <span class="s3">abc</span>
      " def ghijkl "
      <span class="s1">mno</span>
      " pqr!"
    </div>
  </li>
</ol>"""


tree = html.fromstring(h)

for li in tree.xpath("//ol/li"):
    print(li.xpath("normalize-space(.)"))

这给了你:

hi " hello " world!
abc " def ghijkl " mno " pqr!"