从python中的.aspx网页获取xlsx文件

时间:2018-02-03 07:23:39

标签: python pandas

我尝试从以下网址读取.xlsx文件,但即使可以从浏览器成功下载文件,pd.read_excel也会出错。

http://members.tsetmc.com/tsev2/excel/MarketWatchPlus.aspx?d=0

import numpy as np
import pandas as pd
data=pd.read_excel("http://members.tsetmc.com/tsev2/excel/MarketWatchPlus.aspx?d=0")

追溯是

>>> data=pd.read_excel("http://members.tsetmc.com/tsev2/excel/MarketWatchPlus.aspx?d=0")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
[...]
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; 
found b'\x1f\x8b\x08\x00}\xceuZ'

1 个答案:

答案 0 :(得分:1)

The first four bytes shown, \x1f\x8b\x08\x00, make it clear that we're receiving a gzipped file, which pandas isn't automatically decompressing. We can do it ourselves, though:

In [54]: import urllib.request, gzip

In [55]: df = pd.read_excel(gzip.GzipFile(fileobj=urllib.request.urlopen(url)))

In [56]: df.iloc[:5, :5]
Out[56]: 
  دیده بان بازار : 1396/11/14 - زمان آخرین معامله : 14:42:03  \
0                                               نماد           
1                                                فسا           
2                                              فرآور           
3                                              فملي2           
4                                              وبملت           

                Unnamed: 1 Unnamed: 2 Unnamed: 3    Unnamed: 4  
0                      نام      تعداد        حجم          ارزش  
1             پتروشيمي فسا        512    5251556    3647287532  
2  فرآوري‌موادمعدني‌ايران‌        310     694381   11249763313  
3   ملي‌ صنايع‌ مس‌ ايران‌          1   40949671  115887568930  
4                 بانك ملت        350    6350364    6561761997  

The remaining tweaks to get the columns set appropriately aren't related to the problem in the post and so I'll leave those alone.