如何在此xpath表达式中获取内部html内容?

时间:2017-05-21 17:03:55

标签: xpath html-parsing

我有一些HTML代码

<li><h3>Number Theory - Even Factors</h3>
    <p lang="title">Number N = 2<sup>6</sup> * 5<sup>5</sup> * 7<sup>6</sup> * 10<sup>7</sup>; how many factors of N are even numbers?</p>
    <ol class="xyz">
        <li>1183</li>
        <li>1200</li>
        <li>1050</li>
        <li>840</li>
    </ol>
    <ul class="exp">
        <li class="grey fleft">
            <span class="qlabs_tooltip_bottom qlabs_tooltip_style_33" style="cursor:pointer;">
            <span>
                <strong>Correct Answer</strong>
                    Choice (A).</br>1183
                </span> 
                Correct answer
            </span>
        </li>
        <li class="primary fleft">
            <a href="factors_6.shtml">Explanatory Answer</a>
        </li>
        <li class="grey1 fleft">Factors - Even numbers</li>
        <li class="orange flrt">Medium</li>
    </ul>       
</li>

在上面的HTML代码段中,我尝试提取<p lang="title"> Notice how it has <sup></sup> and <sub></sub> tags being used inside.

我的Xpath表达式.// p [@lang =&#34; title&#34;] / text()不检索sub和sup内容。如何在

下面获得此输出

所需输出

Number N = 2<sup>6</sup>*5<sup>5</sup> * 7<sup>6</sup> * 10<sup>7</sup>; how many factors of N are even numbers?

1 个答案:

答案 0 :(得分:0)

的XPath

您可以使用innerHTML获取node(),如下所示:

//p[@lang="title"]/node()

请注意,它会返回一个节点数组

的Python

您可以使用以下innerHTML代码

获取必需的Python
from BeautifulSoup import BeautifulSoup

def innerHTML(element):
    "Function that receives element and returns its innerHTML"
    return element.decode_contents(formatter="html")

html = """<html>
               <head>...
               <body>...
               Your HTML source code
               ..."""

soup = BeautifulSoup(html)
paragraph = soup.find('p', { "lang" : "title" })

print(innerHTML(paragraph))

输出:

'Number N = 2<sup>6</sup> * 5<sup>5</sup> * 7<sup>6</sup> * 10<sup>7</sup>; how many factors of N are even numbers?'