Python - 大型XML - 什么是请求,解析和转换项目到dicts的最佳方式?

时间:2016-10-28 01:56:07

标签: python xml parsing dictionary request

我在这里找到了一些关于大型XML解析请求的主题,但是我无法将它们与我需要的相匹配。

我需要使用请求获取大型XML。它有<products>,我需要转换dict中的每个产品,使用数据集的table.insert(dict)发送到数据库。

XML文件:

<?xml version="1.0"?><crossDocking customerId="00000000000" company="xyz" database="08/12/2014 14:16:56" numberResults="118">
    <product>
        <prod_id>18108</prod_id>
        <brand><![CDATA[PHILIPS]]></brand>
        <prod_name><![CDATA[Fone de Ouvido SHP2500/00 para TV com Controle de Volume PHILIPS]]></prod_name>
        <seg_name><![CDATA[Eletrônicos##Fones de Ouvidos##Com Fio]]></seg_name>
        <image><![CDATA[http://static.hayamax.com.br/imgProd/18108_500_001.jpg]]></image>
        <link><![CDATA[http://www.hayamax.com.br/fone-de-ouvido-shp2500-00-para-tv-com-controle-de-volume-philips]]></link>
        <NBM><![CDATA[8518.30.00]]></NBM>
        <saleUnit><![CDATA[PC]]></saleUnit>
        <saleQuant>1</saleQuant>
        <weightValue>0.471</weightValue>
        <weightUnit><![CDATA[KG]]></weightUnit>
        <shortname><![CDATA[FONE PHILIPS SHP2500/00 6MT PTA]]></shortname>
        <EAN>8710895945875</EAN>
        <width>19.900</width>
        <height>24.000</height>
        <depth>10.900</depth>
        <information>
            <description><![CDATA[Este fone de ouvido tem um refletor acústico que melhora o reforço dinâmico de graves para seus momentos de lazer com aparelho de som ou TV.   Toda a orelha é coberta, privilegiando a qualidade de som. Proporciona conforto mesmo no uso prolongado. Possui prático cabo de 6 m, permitindo que você fique onde preferir em sua sala, e controle em linha que simplifica o ajuste de volume.]]></description>
            <characteristics><![CDATA[Tipo de Imã: Ímã em Ferrite Bobina de Voz: Cobre Resposta frequência: 15Hz a 22KHz Impedância: 32Ohms Potência: 500mW Sensibilidade: 100dB Diâmetro falante: 40mm Conector: Conectores P2 3,5 e 6,3mm estéreo cromados Cor: Prata Controle volume: Possui controle de volume no cabo]]></characteristics>
            <technical><![CDATA[Comprimento do cabo: Cabo destacável de 6 metros]]></technical>
            <included><![CDATA[]]></included>
        </information>
        <PPB>0</PPB>
        <warrantyDays>06</warrantyDays>
        <price>42.22</price>
        <stock>0</stock>
        <IPI>0.00</IPI>
        <sourceFat>PR</sourceFat>
    </product>

我应该使用ElementTree吗?

更新

import requests
import xmltodict
import xml.etree.ElementTree as etree
import lxml
from lxml import etree

url = "http://xxxxxxxxx"
response = requests.get(url, stream=True)
print response
xml = etree.parse(response.content)

for product in xml:
    print product ## <Product>

输出:

<Response [200]>
Traceback (most recent call last):
  File "/home/ubuntu/workspace/ex50/bin/hayamax/hayamax.py", line 10, in <module>
    xml = etree.parse(response.content)
  File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:79841)
  File "src/lxml/parser.pxi", line 1793, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:116175)
  File "src/lxml/parser.pxi", line 1819, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:116525)
  File "src/lxml/parser.pxi", line 1723, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:115413)
  File "src/lxml/parser.pxi", line 1126, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:110110)
  File "src/lxml/parser.pxi", line 584, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103584)
  File "src/lxml/parser.pxi", line 694, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105238)
  File "src/lxml/parser.pxi", line 622, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:104104)
IOError

1 个答案:

答案 0 :(得分:0)

我不确定我是否完全理解你,但试试这个。这就是我读这类数据的方式:

import urllib
response = urllib.urlopen(url)
data= response.read()
tree = etree.fromstring(data)
xml=tree.findall('products/product')

这假定它们是一个很长的列表,可以从中提取<product> many xml nested things </product>嵌套在<products> many products</products>

我认为它会做你想要的。然后你可以循环通过内部部件以同样的方式做你想做的事。