强制在Python中将xml文件保存为xls格式

时间:2017-07-21 13:33:34

标签: python xml excel

我这里有这个代码,以Excel 2004 xml格式下载这个基金数据:

import urllib2
url = 'https://www.ishares.com/us/258100/fund-download.dl'
s = urllib2.urlopen(url)
contents = s.read()
file = open("export.xml", 'w')
file.write(contents)
file.close()

我的目标是以编程方式将此文件转换为.xls,然后我可以将其读入pandas DataFrame。我知道我可以使用python的xml库解析这个文件但是,我注意到如果我打开xml文件并用xls文件扩展名手动保存它,它可以被pandas读取并得到我想要的结果。

我还尝试使用以下代码重命名文件扩展名,但是这种方法不会强制使用#34;保存文件,它仍然作为基础xml文档与xls文件ext ..

import os
import sys
folder = '~/models'
for filename in os.listdir(folder):
    if filename.startswith('export'):
        infilename = filename
        newname = infilename.replace('newfile.xls', 'f.xls')
        output = os.rename(infilename, newname)

https://www.ishares.com/us/258100/fund-download.dl

3 个答案:

答案 0 :(得分:0)

使用Excel for Windows,请考虑使用Python将COM连接到使用win32com模块的Excel对象库。具体来说,使用Excel的Workbooks.OpenXMLSaveAs方法将下载的xml保存为csv:

import os
import win32com.client as win32    
import requests as r
import pandas as pd

cd = os.path.dirname(os.path.abspath(__file__))

url = "http://www.ishares.com/us/258100/fund-download.dl"
xmlfile = os.path.join(cd, 'iSharesDownload.xml')
csvfile = os.path.join(cd, 'iSharesDownload.csv')

# DOWNLOAD FILE
try:
    rqpage = r.get(url)
    with open(xmlfile, 'wb') as f:
        f.write(rqpage.content)    
except Exception as e:
    print(e)    
finally:
    rqpage = None

# EXCEL COM TO SAVE EXCEL XML AS CSV
if os.path.exists(csvfile):
    os.remove(csvfile)
try:
    excel = win32.gencache.EnsureDispatch('Excel.Application')
    wb = excel.Workbooks.OpenXML(xmlfile)
    wb.SaveAs(csvfile, 6)
    wb.Close(True)    
except Exception as e:
    print(e)    
finally:
    # RELEASES RESOURCES
    wb = None
    excel = None

# IMPORT CSV INTO PANDAS DATAFRAME
df = pd.read_csv(csvfile, skiprows=8)
print(df.describe())

#        Weight (%)       Price  Coupon (%)     YTM (%)  Yield to Worst (%)    Duration
# count  625.000000  625.000000  625.000000  625.000000          625.000000  625.000000
# mean     0.159888  101.298768    6.500256    5.881168            5.313760    2.128688
# std      0.126833   10.469460    1.932744    4.059226            4.224268    1.283360
# min     -0.110000    0.000000    0.000000    0.000000           -8.030000    0.000000
# 25%      0.090000  100.380000    5.130000    3.430000            3.070000    0.970000
# 50%      0.130000  102.940000    6.380000    4.930000            3.910000    2.240000
# 75%      0.190000  105.000000    7.630000    6.820000            6.070000    3.260000
# max      1.750000  128.750000   12.500000   40.900000           40.900000    5.060000

答案 1 :(得分:0)

使用Excel for MAC,考虑使用VBA解决方案,因为VBA是与Excel对象库交互的最常用语言。下面下载 iShares xml,然后使用OpenXMLSaveAs方法将其保存为csv以进行pandas导入。

注意:这在Mac上未经测试,但希望 Microsoft.XMLHTTP 对象可用。

VBA (保存在启用宏的工作簿中)

Option Explicit

Sub DownloadXML()
On Error GoTo ErrHandle
    Dim wb As Workbook
    Dim xmlDoc As Object
    Dim xmlfile As String, csvfile As String

    xmlfile = ActiveWorkbook.Path & "\file.xml"
    csvfile = ActiveWorkbook.Path & "\file.csv"

    Call DownloadFile("https://www.ishares.com/us/258100/fund-download.dl", xmlfile)

    Set wb = Excel.Workbooks.OpenXML(xmlfile)

    wb.SaveAs csvfile, 6
    wb.Close True

ExitHandle:
    Set wb = Nothing
    Set xmlDoc = Nothing
    Exit Sub

ErrHandle:
    MsgBox Err.Number & " - " & Err.Description, vbCritical
    Resume ExitHandle
End Sub

Function DownloadFile(url As String, filePath As String)
    Dim WinHttpReq As Object, oStream As Object

    Set WinHttpReq = CreateObject("Microsoft.XMLHTTP")
    WinHttpReq.Open "GET", url, False
    WinHttpReq.send

    If WinHttpReq.Status = 200 Then
        Set oStream = CreateObject("ADODB.Stream")
        oStream.Open
        oStream.Type = 1
        oStream.Write WinHttpReq.responseBody
        oStream.SaveToFile filePath, 2 ' 1 = no overwrite, 2 = overwrite
        oStream.Close
    End If

    Set WinHttpReq = Nothing
    Set oStream = Nothing
End Function

<强>的Python

import pandas as pd

csvfile = "/path/to/file.csv"

# IMPORT CSV INTO PANDAS DATAFRAME
df = pd.read_csv(csvfile, skiprows=8)
print(df.describe())

#        Weight (%)       Price  Coupon (%)     YTM (%)  Yield to Worst (%)    Duration
# count  625.000000  625.000000  625.000000  625.000000          625.000000  625.000000
# mean     0.159888  101.298768    6.500256    5.881168            5.313760    2.128688
# std      0.126833   10.469460    1.932744    4.059226            4.224268    1.283360
# min     -0.110000    0.000000    0.000000    0.000000           -8.030000    0.000000
# 25%      0.090000  100.380000    5.130000    3.430000            3.070000    0.970000
# 50%      0.130000  102.940000    6.380000    4.930000            3.910000    2.240000
# 75%      0.190000  105.000000    7.630000    6.820000            6.070000    3.260000
# max      1.750000  128.750000   12.500000   40.900000           40.900000    5.060000

答案 2 :(得分:0)

我能够通过发现我正在使用的网站开发了api来绕过网络抓取。然后使用python的requests模块。

url = "https://www.blackrock.com/tools/hackathon/performance
for ticker in tickers:
    params = {'identifiers': ticker ,
              'returnsType':'MONTHLY'}
    request = requests.get(url, params=params)
    json = request.json()