带有流和缓存请求的ElementTree.iterparse会抛出ParseError

时间:2016-07-11 03:36:17

标签: python caching streaming python-requests elementtree

我有一个Flask应用程序,可以从URL中检索XML文档并对其进行处理。我正在使用带有redis的requests_cache来避免额外的请求,并使用ElementTree.iterparse迭代流式内容。这是我的代码示例(开发服务器和交互式解释器都有相同的结果):

>>> import requests, requests_cache
>>> import xml.etree.ElementTree as ET
>>> requests_cache.install_cache('test', backend='redis', expire_after=300)
>>> url = 'http://myanimelist.net/malappinfo.php?u=doomcat55&status=all&type=anime'
>>> response = requests.get(url, stream=True)
>>> for event, node in ET.iterparse(response.raw):
...     print(node.tag)

运行上面的代码会抛出一个ParseError:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1301, in __next__
    self._root = self._parser._close_and_return_root()
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1236, in _close_and_return_root
    root = self._parser.close()
xml.etree.ElementTree.ParseError: no element found: line 1, column 0

但是,在缓存过期之前再次运行完全相同的代码实际上会打印出预期的结果!为什么XML解析仅在第一次失败,我该如何解决?

<小时/> 修改 如果它有用,我注意到在没有缓存的情况下运行相同的代码会导致不同的ParseError:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1289, in __next__
    for event in self._parser.read_events():
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1272, in read_events
    raise event
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1230, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0

1 个答案:

答案 0 :(得分:0)

我可以告诉你为什么它在两种情况下都会失败,因为后者是因为第一次调用raw时数据是 gzip ,无论你第二次读取数据时发生了什么解压缩:

如果您打印行:

for line in response.raw:
    print(line)

你看:

�=V���H�������mqn˫+i�������UȣT����F,�-§�ߓ+���G�o~�����7�C�M{�3D����೺C����ݣ�i�����SD�݌.N�&�HF�I�֎�9���J�ķ����s�*H�@$p�o���Ĕ�Y��v�����8}I,��`�cy�����gE�� �!��B�  &|(^���jo�?�^,���H���^~p��a���׫��j�

����a۱Yk<qba�RN6�����l�/�W����{/��߸�G

X�LxH��哫 .���g(�MQ ����Y�q��:&��>s�M�d4�v|��ܓ��k��A17�

然后解压缩:

import zlib
def decomp(raw):
    decompressor = zlib.decompressobj(zlib.MAX_WBITS | 16)
    for line in raw:
        yield decompressor.decompress(line)

for line in decomp(response.raw):
    print(line)

您看到解压缩有效:

<?xml version="1.0" encoding="UTF-8"?>
<myanimelist><myinfo><user_id>4731313</user_id><user_name>Doomcat55</user_name><user_watching>3</user_watching><user_completed>120</user_completed><user_onhold>8</user_onhold><user_dropped>41</user_dropped><user_plantowatch>2</user_plantowatch><user_days_spent_watching>27.83</user_days_spent_watching></myinfo><anime><series_animedb_id>64</series_animedb_id><series_title>Rozen Maiden</series_title><series_synonyms>; Rozen Maiden</series_synonyms><series_type>1</series_type><series_episodes>12</series_episodes><series_status>2</series_status><series_start>2004-10-08</series_start><series_end>2004-12-24</series_end><series_image>http://cdn.myanimelist.net/images/anime/2/15728.jpg</series_image>
..................................

现在缓存后,如果我们读取几个字节:

response.raw.read(39)

您看到我们获得了解压缩数据:

<?xml version="1.0" encoding="UTF-8"?>

忘记缓存并将 response.raw 传递给iterparse:

    raise e
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0

因为它无法处理 gzipped 数据。

在缓存时第一次运行时也使用以下内容:

for line in response.raw:
    print(line)

给我:

    ValueError: I/O operation on closed file.

那是因为缓存已经消耗了数据,所以事实上没有任何东西,所以不确定使用带缓存的raw是否真的可以,因为数据被消耗并且文件句柄被关闭。 / p>

如果您使用lxml.fromstringlist

import requests, requests_cache
import lxml.etree as et
requests_cache.install_cache()

def lazy(resp):
    for line in resp.iter_content():
        yield line

url = 'http://myanimelist.net/malappinfo.php?u=doomcat55&status=all&type=anime'

response = requests.get(url, stream=True)

for node in et.fromstringlist(lazy(response)):
    print(node)