如何解析远程文件?

时间:2014-02-17 11:33:45

标签: python python-3.x

请帮助解析来自互联网的文件。

import pprint
import xml.dom.minidom
from xml.dom.minidom import Node

import requests

addr = requests.get('http://fh79272k.bget.ru/py_test/books.xml')
print(addr.status_code)

doc = xml.dom.minidom.parse(str(addr))          # load doc into object
                                                  # usually parsed up front


mapping = {}
for node in doc.getElementsByTagName("book"):     # traverse DOM object
    isbn = node.getAttribute("isbn")              # via DOM object API
    L = node.getElementsByTagName("title")
    for node2 in L:
        title = ""
        for node3 in node2.childNodes:
            if node3.nodeType == Node.TEXT_NODE:
                title += node3.data 
        mapping[isbn] = title

# mapping now has the same value as in the SAX example
pprint.pprint(mapping)

此脚本不起作用。错误信息是:

  

Traceback(最近一次调用最后一次):文件   “C:\ VINT \ OPENSERVER \ OpenServer的\域\本地主机\蟒\ parse_html \ 1个\ dombook.py”,   第14行       doc = xml.dom.minidom.parse(str(addr))#doc doc into object File“C:\ Python33 \ lib \ xml \ dom \ minidom.py”,1960年,in   解析       return expatbuilder.parse(file)文件“C:\ Python33 \ lib \ xml \ dom \ expatbuilder.py”,第908行,解析       fp = open(file,'rb')OSError:[Errno 22]无效参数:''

XML:

<catalog>
<book isbn="0-596-00128-2">
<title>Python & XML</title>
<date>December 2001</date>
<author>Jones, Drake</author>
</book>
<book isbn="0-596-15810-6">
<title>Programming Python, 4th Edition</title>
<date>October 2010</date>
<author>Lutz</author>
</book>
<book isbn="0-596-15806-8">
<title>Learning Python, 4th Edition</title>
<date>September 2009</date>
<author>Lutz</author>
</book>
<book isbn="0-596-15808-4">
<title>Python Pocket Reference, 4th Edition</title>
<date>October 2009</date>
<author>Lutz</author>
</book>
<book isbn="0-596-00797-3">
<title>Python Cookbook, 2nd Edition</title>
<date>March 2005</date>
<author>Martelli, Ravenscroft, Ascher</author>
</book>
<book isbn="0-596-10046-9">
<title>Python in a Nutshell, 2nd Edition</title>
<date>July 2006</date>
<author>Martelli</author>
</book>
<!--
 plus many more Python books that should appear here 
-->
</catalog>

1 个答案:

答案 0 :(得分:1)

您正在从响应对象构建XML,而不是从正文中的文本构建XML。而不是str(addr),请使用addr.text

doc = xml.dom.minidom.parse(addr.text)

此外,使用XML解析器来处理HTML是很麻烦的。尝试使用Beautiful Soup