Question

我从XML文件中读取了大量数据，但之后只有一部分数据会保存在数据库中。

XML的结构如下：

<element id="1" other_attrib="">
<element id="2" other_attrib="">
...
<other_element>
  <elem id="1">
  <elem id="100">
</other_element>
<other_element>
  <elem id...>
</other_element>

所有element标记都位于other_elements标记之前（此XML是第三方，我无法对其进行重组）

我必须首先阅读所有elements，因为它们包含其他属性，但只保存other_element引用的那些属性。

所以我用shelve打开writeback=False，使用lxml.iterparse()来解析XML并随时保存元素，但是添加了很多元素后（我不知道）确切的数字，但它成了数十万）我收到以下错误：

HASH: Out of overflow pages.  Increase page size
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/Library/Python/2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  ...
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shelve.py", line 133, in __setitem__
    self.dict[key] = f.getvalue()

基本上我所做的是为每个元素iterparse返回：

if elem.tag == 'element':
  shlv[elem['id']] = {'attrib1': elem['other_attrib'], 'attrib2': elem['attrib2']}
elif elem.tag == 'other_element':
  # Here I iterate through this tags children and find references in shlv object
  for ref in elem:
    save_in_database(shlv[ref.attrib['id']])

我可以更改哪些shelve处理更多数据？或者我应该使用其他东西来存储这些数据吗？

Shelve从溢出页面

0 个答案: