Question

我这里有这个代码，以Excel 2004 xml格式下载这个基金数据：

import urllib2
url = 'https://www.ishares.com/us/258100/fund-download.dl'
s = urllib2.urlopen(url)
contents = s.read()
file = open("export.xml", 'w')
file.write(contents)
file.close()

我的目标是以编程方式将此文件转换为.xls，然后我可以将其读入pandas DataFrame。我知道我可以使用python的xml库解析这个文件但是，我注意到如果我打开xml文件并用xls文件扩展名手动保存它，它可以被pandas读取并得到我想要的结果。

我还尝试使用以下代码重命名文件扩展名，但是这种方法不会强制使用＃34;保存文件，它仍然作为基础xml文档与xls文件ext ..

import os
import sys
folder = '~/models'
for filename in os.listdir(folder):
    if filename.startswith('export'):
        infilename = filename
        newname = infilename.replace('newfile.xls', 'f.xls')
        output = os.rename(infilename, newname)

https://www.ishares.com/us/258100/fund-download.dl

Answer 1

使用Excel for Windows，请考虑使用Python将COM连接到使用win32com模块的Excel对象库。具体来说，使用Excel的Workbooks.OpenXML和SaveAs方法将下载的xml保存为csv：

import os
import win32com.client as win32    
import requests as r
import pandas as pd

cd = os.path.dirname(os.path.abspath(__file__))

url = "http://www.ishares.com/us/258100/fund-download.dl"
xmlfile = os.path.join(cd, 'iSharesDownload.xml')
csvfile = os.path.join(cd, 'iSharesDownload.csv')

# DOWNLOAD FILE
try:
    rqpage = r.get(url)
    with open(xmlfile, 'wb') as f:
        f.write(rqpage.content)    
except Exception as e:
    print(e)    
finally:
    rqpage = None

# EXCEL COM TO SAVE EXCEL XML AS CSV
if os.path.exists(csvfile):
    os.remove(csvfile)
try:
    excel = win32.gencache.EnsureDispatch('Excel.Application')
    wb = excel.Workbooks.OpenXML(xmlfile)
    wb.SaveAs(csvfile, 6)
    wb.Close(True)    
except Exception as e:
    print(e)    
finally:
    # RELEASES RESOURCES
    wb = None
    excel = None

# IMPORT CSV INTO PANDAS DATAFRAME
df = pd.read_csv(csvfile, skiprows=8)
print(df.describe())

#        Weight (%)       Price  Coupon (%)     YTM (%)  Yield to Worst (%)    Duration
# count  625.000000  625.000000  625.000000  625.000000          625.000000  625.000000
# mean     0.159888  101.298768    6.500256    5.881168            5.313760    2.128688
# std      0.126833   10.469460    1.932744    4.059226            4.224268    1.283360
# min     -0.110000    0.000000    0.000000    0.000000           -8.030000    0.000000
# 25%      0.090000  100.380000    5.130000    3.430000            3.070000    0.970000
# 50%      0.130000  102.940000    6.380000    4.930000            3.910000    2.240000
# 75%      0.190000  105.000000    7.630000    6.820000            6.070000    3.260000
# max      1.750000  128.750000   12.500000   40.900000           40.900000    5.060000

Answer 2

使用Excel for MAC，考虑使用VBA解决方案，因为VBA是与Excel对象库交互的最常用语言。下面下载 iShares xml，然后使用OpenXML和SaveAs方法将其保存为csv以进行pandas导入。

注意：这在Mac上未经测试，但希望 Microsoft.XMLHTTP 对象可用。

VBA （保存在启用宏的工作簿中）

Option Explicit

Sub DownloadXML()
On Error GoTo ErrHandle
    Dim wb As Workbook
    Dim xmlDoc As Object
    Dim xmlfile As String, csvfile As String

    xmlfile = ActiveWorkbook.Path & "\file.xml"
    csvfile = ActiveWorkbook.Path & "\file.csv"

    Call DownloadFile("https://www.ishares.com/us/258100/fund-download.dl", xmlfile)

    Set wb = Excel.Workbooks.OpenXML(xmlfile)

    wb.SaveAs csvfile, 6
    wb.Close True

ExitHandle:
    Set wb = Nothing
    Set xmlDoc = Nothing
    Exit Sub

ErrHandle:
    MsgBox Err.Number & " - " & Err.Description, vbCritical
    Resume ExitHandle
End Sub

Function DownloadFile(url As String, filePath As String)
    Dim WinHttpReq As Object, oStream As Object

    Set WinHttpReq = CreateObject("Microsoft.XMLHTTP")
    WinHttpReq.Open "GET", url, False
    WinHttpReq.send

    If WinHttpReq.Status = 200 Then
        Set oStream = CreateObject("ADODB.Stream")
        oStream.Open
        oStream.Type = 1
        oStream.Write WinHttpReq.responseBody
        oStream.SaveToFile filePath, 2 ' 1 = no overwrite, 2 = overwrite
        oStream.Close
    End If

    Set WinHttpReq = Nothing
    Set oStream = Nothing
End Function

<强>的Python

import pandas as pd

csvfile = "/path/to/file.csv"

# IMPORT CSV INTO PANDAS DATAFRAME
df = pd.read_csv(csvfile, skiprows=8)
print(df.describe())

#        Weight (%)       Price  Coupon (%)     YTM (%)  Yield to Worst (%)    Duration
# count  625.000000  625.000000  625.000000  625.000000          625.000000  625.000000
# mean     0.159888  101.298768    6.500256    5.881168            5.313760    2.128688
# std      0.126833   10.469460    1.932744    4.059226            4.224268    1.283360
# min     -0.110000    0.000000    0.000000    0.000000           -8.030000    0.000000
# 25%      0.090000  100.380000    5.130000    3.430000            3.070000    0.970000
# 50%      0.130000  102.940000    6.380000    4.930000            3.910000    2.240000
# 75%      0.190000  105.000000    7.630000    6.820000            6.070000    3.260000
# max      1.750000  128.750000   12.500000   40.900000           40.900000    5.060000

Answer 3

我能够通过发现我正在使用的网站开发了api来绕过网络抓取。然后使用python的requests模块。

url = "https://www.blackrock.com/tools/hackathon/performance
for ticker in tickers:
    params = {'identifiers': ticker ,
              'returnsType':'MONTHLY'}
    request = requests.get(url, params=params)
    json = request.json()

强制在Python中将xml文件保存为xls格式

3 个答案: