使用lxml和xpath在不同的祖先中获取以下节点

时间:2014-01-17 22:37:16

标签: python xpath lxml

我正在编写一个读取数学方程式的文本到语音程序。我有一个线程需要拉数学方程式(如MathJax SVG)并将它们解析为散文。

由于内容的布局方式,数学方程式可以任意嵌套在其他元素中,如段落,粗体,表格等。

使用对当前元素的引用,如何获取可能嵌入其他父/祖先的下一个<span class="MathJax_SVG">

我尝试使用以下方法解决它:

nextMath = currentElement.xpath('following::.//span[@class=\'MathJax_SVG\']')

什么都不返回,即使我可以直观地确认它后面有什么东西。我尝试删除句点,但lxml抱怨我的XPath格式不正确。

你们之前遇到过这个吗?

P.S。这是一个测试文档,以表明我的观点:

<html>
   <head>
      <title>Test Document</title>
   </head>
   <body>
      <h1 id="mainHeading">The Quadratic Formula</h1>
      <p>The quadratic formula is used to solve quadratic equations. Here is the formula:</p>
      <p><span class="MathJax_SVG" id="MathJax_Element_Frame_1">removed the SVG</span></p>
      <p>Here are some possible values when you use the formula:</p>
      <p>
      <table>
         <tr>
            <td><span class="MathJax_SVG" id="MathJax_Element_Frame_2">removed the SVG</span></td>
            <td><span class="MathJax_SVG" id="MathJax_Element_Frame_3">removed the SVG</span></td>
         </tr>
         <tr>
            <td><span class="MathJax_SVG" id="MathJax_Element_Frame_4">removed the SVG</span></td>
            <td><span class="MathJax_SVG" id="MathJax_Element_Frame_5">removed the SVG</span></td>
         </tr>
      </table>
      </p>
   </body>
</html>

更新

据悉lxml不支持绝对位置。这可能是相关的。

某些测试代码(假设您将HTML保存为test.html)

from lxml import html

# Get my html element
with open('test.html', 'r') as f:
    myHtml = html.fromstring(f.read())

# Get the first MathJax element
start = myHtml.find('.//h1[@id=\'mainHeading\']')

print 'My start:', html.tostring(start)

# Get next math equation
nextXPath = 'following::.//span[@class=\'MathJax_SVG\']'
nextElem = start.xpath(nextXPath)

if len(nextElem) > 0:
    print 'Next equation:', html.tostring(nextElem[0])
else:
    print 'No next equation...'

2 个答案:

答案 0 :(得分:0)

您是否需要遍历文档?您还可以直接搜索MathJax_SVG类的span元素:

from lxml import etree
doc = etree.parse(open("test-document.html")).getroot()
maths = doc.xpath("//span[@class='MathJax_SVG']")

答案 1 :(得分:0)

我最终创建了自己的功能来获得我想要的东西。我叫它getNext(elem, xpathString)。如果有更有效的方法来做到这一点,我会全力以赴。我对它的表现没有信心。

from lxml import html

def getNext(elem, xpathString):
    '''
    Gets the next element defined by XPath. The element returned
    may be itself.
    '''
    myElem = elem
    nextElem = elem.find(xpathString)

    while nextElem is None:

        if myElem.getnext() is not None:
            myElem = myElem.getnext()
            nextElem = myElem.find(xpathString)

        else:
            if myElem.getparent() is not None:
                myElem = myElem.getparent()
            else:
                break

    return nextElem


# Get my html element
with open('test.html', 'r') as f:
    myHtml = html.fromstring(f.read())

# Get the first MathJax element
start = myHtml.find('.//span[@id=\'MathJax_Element_Frame_1\']')

print 'My start:', html.tostring(start)

# Get next math equation
nextXPath = './/span[@class=\'MathJax_SVG\']'
nextElem = getNext(start, nextXPath)

if nextElem is not None:
    print 'Next equation:', html.tostring(nextElem)
else:
    print 'No next equation...'