Question

我正在编写一个读取数学方程式的文本到语音程序。我有一个线程需要拉数学方程式（如MathJax SVG）并将它们解析为散文。

由于内容的布局方式，数学方程式可以任意嵌套在其他元素中，如段落，粗体，表格等。

使用对当前元素的引用，如何获取可能嵌入其他父/祖先的下一个<span class="MathJax_SVG">？

我尝试使用以下方法解决它：

nextMath = currentElement.xpath('following::.//span[@class=\'MathJax_SVG\']')

什么都不返回，即使我可以直观地确认它后面有什么东西。我尝试删除句点，但lxml抱怨我的XPath格式不正确。

你们之前遇到过这个吗？

P.S。这是一个测试文档，以表明我的观点：

<html>
   <head>
      <title>Test Document</title>
   </head>
   <body>
      <h1 id="mainHeading">The Quadratic Formula</h1>
      <p>The quadratic formula is used to solve quadratic equations. Here is the formula:</p>
      <p><span class="MathJax_SVG" id="MathJax_Element_Frame_1">removed the SVG</span></p>
      <p>Here are some possible values when you use the formula:</p>
      <p>
      <table>
         <tr>
            <td><span class="MathJax_SVG" id="MathJax_Element_Frame_2">removed the SVG</span></td>
            <td><span class="MathJax_SVG" id="MathJax_Element_Frame_3">removed the SVG</span></td>
         </tr>
         <tr>
            <td><span class="MathJax_SVG" id="MathJax_Element_Frame_4">removed the SVG</span></td>
            <td><span class="MathJax_SVG" id="MathJax_Element_Frame_5">removed the SVG</span></td>
         </tr>
      </table>
      </p>
   </body>
</html>

更新

据悉lxml不支持绝对位置。这可能是相关的。

某些测试代码（假设您将HTML保存为test.html）

from lxml import html

# Get my html element
with open('test.html', 'r') as f:
    myHtml = html.fromstring(f.read())

# Get the first MathJax element
start = myHtml.find('.//h1[@id=\'mainHeading\']')

print 'My start:', html.tostring(start)

# Get next math equation
nextXPath = 'following::.//span[@class=\'MathJax_SVG\']'
nextElem = start.xpath(nextXPath)

if len(nextElem) > 0:
    print 'Next equation:', html.tostring(nextElem[0])
else:
    print 'No next equation...'

Answer 1

您是否需要遍历文档？您还可以直接搜索MathJax_SVG类的span元素：

from lxml import etree
doc = etree.parse(open("test-document.html")).getroot()
maths = doc.xpath("//span[@class='MathJax_SVG']")

Answer 2

我最终创建了自己的功能来获得我想要的东西。我叫它getNext(elem, xpathString)。如果有更有效的方法来做到这一点，我会全力以赴。我对它的表现没有信心。

from lxml import html

def getNext(elem, xpathString):
    '''
    Gets the next element defined by XPath. The element returned
    may be itself.
    '''
    myElem = elem
    nextElem = elem.find(xpathString)

    while nextElem is None:

        if myElem.getnext() is not None:
            myElem = myElem.getnext()
            nextElem = myElem.find(xpathString)

        else:
            if myElem.getparent() is not None:
                myElem = myElem.getparent()
            else:
                break

    return nextElem


# Get my html element
with open('test.html', 'r') as f:
    myHtml = html.fromstring(f.read())

# Get the first MathJax element
start = myHtml.find('.//span[@id=\'MathJax_Element_Frame_1\']')

print 'My start:', html.tostring(start)

# Get next math equation
nextXPath = './/span[@class=\'MathJax_SVG\']'
nextElem = getNext(start, nextXPath)

if nextElem is not None:
    print 'Next equation:', html.tostring(nextElem)
else:
    print 'No next equation...'

使用lxml和xpath在不同的祖先中获取以下节点

2 个答案: