我正在编写一个读取数学方程式的文本到语音程序。我有一个线程需要拉数学方程式(如MathJax SVG)并将它们解析为散文。
由于内容的布局方式,数学方程式可以任意嵌套在其他元素中,如段落,粗体,表格等。
使用对当前元素的引用,如何获取可能嵌入其他父/祖先的下一个<span class="MathJax_SVG">
?
我尝试使用以下方法解决它:
nextMath = currentElement.xpath('following::.//span[@class=\'MathJax_SVG\']')
什么都不返回,即使我可以直观地确认它后面有什么东西。我尝试删除句点,但lxml
抱怨我的XPath格式不正确。
你们之前遇到过这个吗?
P.S。这是一个测试文档,以表明我的观点:
<html>
<head>
<title>Test Document</title>
</head>
<body>
<h1 id="mainHeading">The Quadratic Formula</h1>
<p>The quadratic formula is used to solve quadratic equations. Here is the formula:</p>
<p><span class="MathJax_SVG" id="MathJax_Element_Frame_1">removed the SVG</span></p>
<p>Here are some possible values when you use the formula:</p>
<p>
<table>
<tr>
<td><span class="MathJax_SVG" id="MathJax_Element_Frame_2">removed the SVG</span></td>
<td><span class="MathJax_SVG" id="MathJax_Element_Frame_3">removed the SVG</span></td>
</tr>
<tr>
<td><span class="MathJax_SVG" id="MathJax_Element_Frame_4">removed the SVG</span></td>
<td><span class="MathJax_SVG" id="MathJax_Element_Frame_5">removed the SVG</span></td>
</tr>
</table>
</p>
</body>
</html>
更新
据悉lxml
不支持绝对位置。这可能是相关的。
某些测试代码(假设您将HTML保存为test.html)
from lxml import html
# Get my html element
with open('test.html', 'r') as f:
myHtml = html.fromstring(f.read())
# Get the first MathJax element
start = myHtml.find('.//h1[@id=\'mainHeading\']')
print 'My start:', html.tostring(start)
# Get next math equation
nextXPath = 'following::.//span[@class=\'MathJax_SVG\']'
nextElem = start.xpath(nextXPath)
if len(nextElem) > 0:
print 'Next equation:', html.tostring(nextElem[0])
else:
print 'No next equation...'
答案 0 :(得分:0)
您是否需要遍历文档?您还可以直接搜索MathJax_SVG类的span元素:
from lxml import etree
doc = etree.parse(open("test-document.html")).getroot()
maths = doc.xpath("//span[@class='MathJax_SVG']")
答案 1 :(得分:0)
我最终创建了自己的功能来获得我想要的东西。我叫它getNext(elem, xpathString)
。如果有更有效的方法来做到这一点,我会全力以赴。我对它的表现没有信心。
from lxml import html
def getNext(elem, xpathString):
'''
Gets the next element defined by XPath. The element returned
may be itself.
'''
myElem = elem
nextElem = elem.find(xpathString)
while nextElem is None:
if myElem.getnext() is not None:
myElem = myElem.getnext()
nextElem = myElem.find(xpathString)
else:
if myElem.getparent() is not None:
myElem = myElem.getparent()
else:
break
return nextElem
# Get my html element
with open('test.html', 'r') as f:
myHtml = html.fromstring(f.read())
# Get the first MathJax element
start = myHtml.find('.//span[@id=\'MathJax_Element_Frame_1\']')
print 'My start:', html.tostring(start)
# Get next math equation
nextXPath = './/span[@class=\'MathJax_SVG\']'
nextElem = getNext(start, nextXPath)
if nextElem is not None:
print 'Next equation:', html.tostring(nextElem)
else:
print 'No next equation...'