Python 3使用lxml编写大型(300+ mb)XML

时间:2018-03-05 08:46:59

标签: python xml sqlalchemy lxml

过去几天我一直在谷歌搜索,但我根本找不到任何远程类似问题:(

我在Python 3中的脚本有一个简单的目标:

  1. 连接MySQL数据库并获取数据
  2. 使用lxml
  3. 创建XML
  4. 将该XML保存到文件
  5. 通常我对包含5000多个元素的XML文件没有任何问题,但在这种情况下,我的VPS(Amazon EC2 micro)达到了最大内存使用率。我的代码(核心部分):

    engine = create_engine(config('DB_URI'))
    Session = sessionmaker(bind=engine)
    session = Session()
    
    query = session.query(Trips.Country,
                          Trips.Region,
                          Trips.Name,
                          Trips.Rebate,
                          Trips.Stars,
                          Trips.PromotionName,
                          Trips.ProductURL,
                          Trips.SubProductURL,
                          Trips.Date,
                          Trips.City,
                          Trips.Type,
                          Trips.Price,
                          TripsImages.ImageURL) \
        .join(TripsImages) \
        .all()
    
    # define namespace xmlns:g
    XMLNS = "{http://base.google.com/ns/1.0}"
    NSMAP = {"g": "http://base.google.com/ns/1.0"}
    
    # create root rss and channel
    rss = etree.Element("rss", nsmap=NSMAP, attrib={"version": "2.0"})
    channel = etree.SubElement(rss, "channel", attrib={"generated": str(datetime.now())})
    
    # add <channel> title and description
    channel_title = etree.SubElement(channel, "title")
    channel_link = etree.SubElement(channel, "link")
    channel_description = etree.SubElement(channel, "description")
    
    channel_title.text = "Trips"
    channel_link.text = "https://example.com"
    channel_description.text = "Description"
    
    # generate xml elements
    for count, elem in enumerate(query):
        item = etree.SubElement(channel, "item")
    
        url = "/".join(["https://example.com",
                        elem.ProductURL,
                        elem.SubProductURL,
                        datetime.strftime(elem.Date, '%Y%m%d')
                        ])
        price_discounted = round(elem.Price - elem.Price * (elem.Rebate / 100))
    
        etree.SubElement(item, XMLNS + "id").text = str(count)
        etree.SubElement(item, XMLNS + "title").text = elem.Country
        etree.SubElement(item, XMLNS + "description").text = elem.Product
        etree.SubElement(item, XMLNS + "link").text = url
        etree.SubElement(item, XMLNS + "image_link").text = elem.ImageURL
        etree.SubElement(item, XMLNS + "condition").text = "new"
        etree.SubElement(item, XMLNS + "availability").text = "in stock"
        etree.SubElement(item, XMLNS + "price").text = str(elem.Price)
        etree.SubElement(item, XMLNS + "sale_price").text = str(price_discounted)
        etree.SubElement(item, XMLNS + "brand").text = "Brand"
        etree.SubElement(item, XMLNS + "additional_image_link").text = elem.ImageURL
        etree.SubElement(item, XMLNS + "custom_label_0").text = elem.Date.strftime("%Y-%m-%d")
        etree.SubElement(item, XMLNS + "custom_label_1").text = elem.Type
        etree.SubElement(item, XMLNS + "custom_label_2").text = str(elem.Stars / 10)
        etree.SubElement(item, XMLNS + "custom_label_3").text = elem.City
        etree.SubElement(item, XMLNS + "custom_label_4").text = elem.Country
        etree.SubElement(item, XMLNS + "custom_label_5").text = elem.PromotionName
    
    
    # finally, serialize XML and save as file
    with open(target_xml, "wb") as file:
        file.write(etree.tostring(rss, encoding="utf-8", pretty_print=True))
    

    我正在使用SQLAlchemy查询数据库和LXML来生成XML文件。从DB获取数据时,它已经创建了包含228890个元素的列表,这些元素使用了大量内存。然后创建XML也会在内存中创建对象,导致总共使用大约1.5 GB RAM。

    此代码在我的笔记本电脑上工作正常,8 GB内存,但在使用1 GB内存和1 gb交换的Amazon EC2上执行时,我点击了write()操作并从Linux获得“Killed”响应。

    在解析大型XML文件时,StackOverflow有很多内容,但除了避免多个I / O操作外,我找不到任何关于在Python中编写大文件的内容:(

1 个答案:

答案 0 :(得分:0)

我认为您需要的是yield_per()函数,因此您不必一次处理所有结果,而是将它们分块。这样可以节省更多内存。您可以在此处阅读有关此功能的更多信息 this link

但是,请注意yield_per()可能会忽略您的某些查询行,而the answer in this question会提供详细说明。如果您认为在阅读后不想使用yield_per(),则可以参考this stackoverflow question上发布的所有答案。

处理大型列表时的另一个提示是使用yield,因此您不必一次加载内存中的所有条目,而是逐个处理它们。希望它有所帮助。