我在S3中有很多XML文件,我试图将它们放入pandas并从那里到MSSQL。在这个例子中,我使用etree来解析文件但老实说我不在乎它是LMXML还是其他任何包。但是,我不认为任何内容实际上是从文件中读取的。下面是我的代码和错误。我觉得我很近,但是,我很可能不是!干杯
import boto3
from pprint import pprint
import lxml
from lxml import etree
import xml.etree.ElementTree as et
import pandas as pd
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
result = paginator.paginate(
Bucket='MYBUCKET',
Prefix='FOLDER/FOLDER2/')
bucket_object_list = []
for page in result:
pprint(page)
if "Contents" in page:
for key in page["Contents"]:
keyString = key["Key"]
pprint(keyString)
bucket_object_list.append(keyString)
s3 = boto3.resource('s3')
for file_name in bucket_object_list:
obj = s3.Object('MYBUCKET', file_name)
print(obj.get())
xmldata = obj.get()["Body"].read().decode('utf-8')
parsed_xml = et.parse(xmldata)
dfcols = ['col1','col2', 'col3']
df_xml = pd.DataFrame(columns=dfcols)
for node in parsed_xml.getroot():
col1 = node.find('col1')
col2 = node.find('col2')
col3 = node.find('col3')
df_xml = df_xml.append(
pd.Series([getvalueofnode(col1), getvalueofnode(col2), getvalueofnode(col3)], index=dfcols),
ignore_index=True)
print(df_xml)
traceback (most recent call last):
File "xmlnightmare.py", line 31, in <module>
parsed_xml = et.parse(xmldata)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 647, in parse
source = open(source, "rb")
IOError: [Errno 2] No such file or directory: u'<?xml version="1.0"?
答案 0 :(得分:0)
..... IOError:[Errno 2]没有这样的文件或目录:u'
ElementTree.parse()
期望XML文档在参数中的位置。如果您有XML文档的内容,如上面的错误消息所示,您应该使用ElementTree.fromstring()
代替:
....
parsed_xml = et.fromstring(xmldata)
dfcols = ['col1','col2', 'col3']
....