Question

我有一个页面。它的html结构是这样的：

<p>
    <strong style="mso-bidi-font-weight: normal;">
        <span>
            Text
        </span>
    </strong>
    <span>
        Text
    </span>
    <span>
        Text
        <em>
            Text
        </em>
        Text
    </span>
    <span>
        Text
        <strong>
            Text
        </strong>
    </span>
</p>

我想从每个p代码中提取文字。它们必须用新线分开。

first p tag:   TextTextText
next p tag:    TextTextText

我做了什么。

url="http://www.toponymic-dictionary.in.ua/index.php?option=com_content&view=section&layout=blog&id="+str(i)+"&Itemid="+str(i+1)
page = urllib.urlopen(url)
pageWritten = page.read()
pageReady = pageWritten.decode('utf-8')
xmldata = lxml.html.document_fromstring(pageReady)
for element in xmldata.xpath('//p[@class="MsoNormal"]'):
    joined_text=u''.join(element.xpath('descendant::text()'))
print joined_text

但它只打印出最后一个p，我不明白为什么。伙计们，我很乐意提供任何帮助。

在Python中使用XPath提取文本

0 个答案: