python memoryerror - 大循环xml到mongodb

时间:2018-04-03 06:41:07

标签: python xml mongodb ubuntu digital-ocean

我从https://clinicaltrials.gov/AllPublicXML.zip下载了一个zip文件,其中包含超过200k xml文件(大多数大小<10 kb)到一个目录(请参阅CODE中的&#39; dirpath_zip&#39;)I在ubuntu 16.04中创建(使用DigitalOcean)。我想要完成的是将所有这些加载到MongoDB中(也安装在与zip文件相同的位置)。

我在下面运行了两次CODE,并且在处理第15988个文件时始终失败。

我已经用Google搜索并尝试阅读有关此特定错误的其他帖子,但无法找到解决此特定问题的方法。实际上,我并不确定究竟是什么问题......非常感谢任何帮助!!

CODE:

import re
import json
import zipfile
import pymongo
import datetime
import xmltodict
from bs4 import BeautifulSoup
from pprint import pprint as ppt


def timestamper(stamp_type="regular"):
    if stamp_type == "regular":
        timestamp = str(datetime.datetime.now())
    elif stamp_type == "filename":
        timestamp = str(datetime.datetime.now()).replace("-", "").replace(":", "").replace(" ", "_")[:15]
    else:
        sys.exit("ERROR [timestamper()]: unexpected 'stamp_type' (parameter) encountered")
    return timestamp


client = pymongo.MongoClient()
db = client['ctgov']
coll_name = "ts_"+timestamper(stamp_type="filename")
coll = db[coll_name]

dirpath_zip = '/glbdat/ctgov/all/alltrials_20180402.zip'
z = zipfile.ZipFile(dirpath_zip, 'r')
i = 0
for xmlfile in z.namelist():
    print(i, 'parsing:', xmlfile)
    if xmlfile == 'Contents.txt':
        print(xmlfile, '==> entering "continue"')
        continue
    else:
        soup = BeautifulSoup(z.read(xmlfile), 'lxml')

        json_study = json.loads(re.sub('\s', ' ', json.dumps(xmltodict.parse(str(soup.find('clinical_study'))))).strip())

        coll.insert_one(json_study)

        i+=1

错误消息:

Traceback (most recent call last):
  File "zip_to_mongo_alltrials.py", line 38, in <module>
    soup = BeautifulSoup(z.read(xmlfile), 'lxml')
  File "/usr/local/lib/python3.5/dist-packages/bs4/__init__.py", line 225, in __init__
    markup, from_encoding, exclude_encodings=exclude_encodings)):
  File "/usr/local/lib/python3.5/dist-packages/bs4/builder/_lxml.py", line 118, in prepare_markup
    for encoding in detector.encodings:
  File "/usr/local/lib/python3.5/dist-packages/bs4/dammit.py", line 264, in encodings
    self.chardet_encoding = chardet_dammit(self.markup)
  File "/usr/local/lib/python3.5/dist-packages/bs4/dammit.py", line 34, in chardet_dammit
    return chardet.detect(s)['encoding']
  File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 30, in detect
    u.feed(aBuf)
  File "/usr/lib/python3/dist-packages/chardet/universaldetector.py", line 128, in feed
    if prober.feed(aBuf) == constants.eFoundIt:
  File "/usr/lib/python3/dist-packages/chardet/charsetgroupprober.py", line 64, in feed
    st = prober.feed(aBuf)
  File "/usr/lib/python3/dist-packages/chardet/hebrewprober.py", line 224, in feed
    aBuf = self.filter_high_bit_only(aBuf)
  File "/usr/lib/python3/dist-packages/chardet/charsetprober.py", line 53, in filter_high_bit_only
    aBuf = re.sub(b'([\x00-\x7F])+', b' ', aBuf)
  File "/usr/lib/python3.5/re.py", line 182, in sub
    return _compile(pattern, flags).sub(repl, string, count)
MemoryError

1 个答案:

答案 0 :(得分:0)

尝试从文件中读取并以另一种方法插入db。 还要添加gc.collect()以进行垃圾回收。

    import gc;
    def read_xml_insert(xmlfile):
        soup = BeautifulSoup(z.read(xmlfile), 'lxml')
        json_study = json.loads(re.sub('\s', ' ', json.dumps(xmltodict.parse(str(soup.find('clinical_study'))))).strip())
        coll.insert_one(json_study)

    for xmlfile in z.namelist():
        print(i, 'parsing:', xmlfile)
        if xmlfile == 'Contents.txt':
             print(xmlfile, '==> entering "continue"')
             continue;
        else:
          read_xml_insert(xmlfile);
          i+=1
        gc.collect()



   `

see