xml解析返回html,如何获取它的文本python

时间:2015-08-15 20:16:47

标签: python xml python-2.7 xml-parsing html-parsing

我使用minidom解析xbrl文件。我使用getElementsByTagName

找到以下内容
<table xmlns="http://www.w3.org/1999/xhtml" style="border-right: 0px; border-top: 0px; border-left: 0px; width: 650px; border-bottom: 0px; border-collapse: collapse"  width="100%"><tr><td colspan="1">Independent auditor's report on the financial statements</td></tr></table><br><table xmlns="http://www.w3.org/1999/xhtml" style="border-right: 0px; border-top: 0px; border-left: 0px; width: 650px; border-bottom: 0px; border-collapse: collapse"  width="100%"><tr><td colspan="1">We have audited the financial statements of KPMG Statsautoriseret Revisionspartnerselskab for the financial year 11 December 2013 – 31 December 2014. The financial statements comprise income statement, balance sheet, statement of changes in equity, cash flow statement accounting policies and notes. The financial statements are prepared in accordance with the Danish Financial Statements Act.</td></tr></table>

现在我想得到的只是文字,我该怎么办?从现在开始,我可以选择美味汤吗?

可以在here找到整个文件,我正在查看的字段为<arr:AuditorsReportOnFinancialStatements

1 个答案:

答案 0 :(得分:0)

soup = BeautifulSoup(auditorsReportOnAuditedFS[0].firstChild.data)
    items = soup.find_all('td')
    listForString = []
    for item in items:
        listForString.append(item.text.encode('utf-8').strip())
    result.append(' : '.join(['AuditorsReportOnFinancialStatements', ' - '.join(listForString)]))

这有效