读取xlsx文件集时出现Pandas.read_excel KeyError

时间:2017-02-26 12:27:41

标签: python excel pandas

我使用Anaconda shell进行数据存储 上传大熊猫一堆excel文件(25个文件) 在此文件https://www.dropbox.com/s/16ea1cw6k63i16p/Newdata.zip?dl=0上 我收到错误。找不到原因以及如何解决它。

OtherService
import pandas as pd
import numpy as np
import os

os.chdir(r"C:\Users\Twentyouts\Desktop\Newdata" )
path = os.getcwd()

files = os.listdir(path)
files_xlsx = [f for f in files if f[-4:] == 'xlsx']

for f in files_xlsx:
    print(f)
    loading = pd.read_excel(f, heading = 0)
    df = df.append(loading)
2016-06-20–2016-06-26.xlsx
2016-06-27–2016-07-03.xlsx
2016-07-04–2016-07-10.xlsx
2016-07-11–2016-07-17.xlsx
2016-08-01–2016-08-07.xlsx
2016-08-15–2016-08-21.xlsx

3 个答案:

答案 0 :(得分:2)

事实上,正如@MaxU指出的那样,Excel文件格式不正确,但在正确保存为.xlsx文件时,有趣地解析了。可能只是通过将扩展名更改为.xlsx来尝试从以前的.xls版本升级无效文件。这两种文件格式不是简单的文本文件,可以在没有危险的情况下更改扩展名,但是二进制格式非常不同。

考虑使用wn32com模块运行COM接口,以使用Excel Workbook.SaveAs方法将格式错误的文件正确保存到实际的OpenXML工作簿中。注意:此解决方案仅适用于安装了MS Excel的Python for Windows。

import pandas as pd
import glob
import win32com.client as win32

xlsxfiles = glob.glob("C:\\Path\\To\\Workbooks\\*.xlsx")

def save_xlsx(srcfile):
    try:
        newfile = srcfile.replace('.xlsx', '_new.xlsx')
        print('Malformed file saved as {}'.format(newfile))
        xlApp = win32.gencache.EnsureDispatch('Excel.Application')
        wb = xlApp.Workbooks.Open(srcfile)
        wb.SaveAs(newfile, 51)                 

    except Exception as e:
        print(e)            
    finally:
        wb.Close(True); wb = None
        xlApp.Quit; xlApp = None    
    return newfile

def xl_read():    
    dfs = []
    for f in xlsxfiles:        
        try:
            df = pd.read_excel(f)
        except Exception as e:            
            df = pd.read_excel(save_xlsx(f))

        print('File: {}, Shape: {}'.format(f, df.shape))
        dfs.append(df)            
    return pd.concat(dfs)

print('Final dataframe shape: {}'.format(xl_read().shape))  

输出 (最终数据框为330,257行和30列)

File: C:\Path\To\Workbooks\2016-06-20–2016-06-26.xlsx, Shape: (5912, 27)
File: C:\Path\To\Workbooks\2016-06-27–2016-07-03.xlsx, Shape: (5362, 27)
File: C:\Path\To\Workbooks\2016-07-04–2016-07-10.xlsx, Shape: (5387, 27)
File: C:\Path\To\Workbooks\2016-07-11–2016-07-17.xlsx, Shape: (5331, 28)
File: C:\Path\To\Workbooks\2016-08-01–2016-08-07.xlsx, Shape: (4965, 28)
Malformed file saved as C:\Path\To\Workbooks\2016-08-15–2016-08-21_new.xlsx
File: C:\Path\To\Workbooks\2016-08-15–2016-08-21.xlsx, Shape: (5315, 27)
File: C:\Path\To\Workbooks\2016-08-22–2016-08-28.xlsx, Shape: (5179, 27)
File: C:\Path\To\Workbooks\2016-08-29–2016-09-04.xlsx, Shape: (5855, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-09-05–2016-09-11_new.xlsx
File: C:\Path\To\Workbooks\2016-09-05–2016-09-11.xlsx, Shape: (5838, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-09-12–2016-09-18_new.xlsx
File: C:\Path\To\Workbooks\2016-09-12–2016-09-18.xlsx, Shape: (5729, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-09-19–2016-09-25_new.xlsx
File: C:\Path\To\Workbooks\2016-09-19–2016-09-25.xlsx, Shape: (6401, 27)
File: C:\Path\To\Workbooks\2016-09-26–2016-10-02.xlsx, Shape: (7018, 27)
File: C:\Path\To\Workbooks\2016-09.xlsx, Shape: (23874, 27)
File: C:\Path\To\Workbooks\2016-10-03–2016-10-09.xlsx, Shape: (6587, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-10–2016-10-12_new.xlsx
File: C:\Path\To\Workbooks\2016-10-10–2016-10-12.xlsx, Shape: (2883, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-10–2016-10-13_new.xlsx
File: C:\Path\To\Workbooks\2016-10-10–2016-10-13.xlsx, Shape: (4174, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-17–2016-10-20_new.xlsx
File: C:\Path\To\Workbooks\2016-10-17–2016-10-20.xlsx, Shape: (4560, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-17–2016-10-23_new.xlsx
File: C:\Path\To\Workbooks\2016-10-17–2016-10-23.xlsx, Shape: (7111, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-24–2016-10-27_new.xlsx
File: C:\Path\To\Workbooks\2016-10-24–2016-10-27.xlsx, Shape: (4921, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-24–2016-10-30_new.xlsx
File: C:\Path\To\Workbooks\2016-10-24–2016-10-30.xlsx, Shape: (8005, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-31–2016-11-06_new.xlsx
File: C:\Path\To\Workbooks\2016-10-31–2016-11-06.xlsx, Shape: (7029, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10_new.xlsx
File: C:\Path\To\Workbooks\2016-10.xlsx, Shape: (28098, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-07–2016-11-13_new.xlsx
File: C:\Path\To\Workbooks\2016-11-07–2016-11-13.xlsx, Shape: (7076, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-14–2016-11-20_new.xlsx
File: C:\Path\To\Workbooks\2016-11-14–2016-11-20.xlsx, Shape: (7758, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-21_new.xlsx
File: C:\Path\To\Workbooks\2016-11-21.xlsx, Shape: (1689, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-21–2016-11-23_new.xlsx
File: C:\Path\To\Workbooks\2016-11-21–2016-11-23.xlsx, Shape: (4711, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-28–2016-12-04_new.xlsx
File: C:\Path\To\Workbooks\2016-11-28–2016-12-04.xlsx, Shape: (9286, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11_new.xlsx
File: C:\Path\To\Workbooks\2016-11.xlsx, Shape: (30505, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-12-05–2016-12-11_new.xlsx
File: C:\Path\To\Workbooks\2016-12-05–2016-12-11.xlsx, Shape: (8802, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-12-12–2016-12-18_new.xlsx
File: C:\Path\To\Workbooks\2016-12-12–2016-12-18.xlsx, Shape: (8333, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-12-16–2016-12-22_new.xlsx
File: C:\Path\To\Workbooks\2016-12-16–2016-12-22.xlsx, Shape: (8592, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-12-26–2016-12-31_new.xlsx
File: C:\Path\To\Workbooks\2016-12-26–2016-12-31.xlsx, Shape: (5362, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-01-01–2017-01-08_new.xlsx
File: C:\Path\To\Workbooks\2017-01-01–2017-01-08.xlsx, Shape: (4322, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-01-09–2017-01-15_new.xlsx
File: C:\Path\To\Workbooks\2017-01-09–2017-01-15.xlsx, Shape: (7608, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-01-23–2017-01-29_new.xlsx
File: C:\Path\To\Workbooks\2017-01-23–2017-01-29.xlsx, Shape: (8903, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-01-30–2017-02-05_new.xlsx
File: C:\Path\To\Workbooks\2017-01-30–2017-02-05.xlsx, Shape: (9173, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-02-13–2017-02-12_new.xlsx
File: C:\Path\To\Workbooks\2017-02-13–2017-02-12.xlsx, Shape: (9144, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-02-13–2017-02-19_new.xlsx
File: C:\Path\To\Workbooks\2017-02-13–2017-02-19.xlsx, Shape: (9911, 27)
File: C:\Path\To\Workbooks\test.xlsx, Shape: (5315, 27)
Malformed file saved as C:\Path\To\Workbooks\Выгрузка 12-15.12_new.xlsx
File: C:\Path\To\Workbooks\Выгрузка 12-15.12.xlsx, Shape: (4818, 27)
Malformed file saved as C:\Path\To\Workbooks\Выгрузка 21-27_new.xlsx
File: C:\Path\To\Workbooks\Выгрузка 21-27.xlsx, Shape: (8876, 27)
File: C:\Path\To\Workbooks\Выгрузка 26-29.12.xlsx, Shape: (4539, 27)
Final dataframe shape: (330257, 30)

甚至考虑使用Windows的数据库引擎方法' ACE引擎通过pyodbc查询带有pandas read_sql的相应工作簿,因为每个工作簿共享相同的工作表名称​​ TDSheet

#...same as above
import pyodbc

def sql_read():    
    dfs = [] 
    for f in xlsxfiles:                
        try:
            conn = pyodbc.connect('Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};'+\
                      'DBQ=C:\\Path\\To\\Workbooks\\{};'.format(f), autocommit=True)
            df =  pd.read_sql('SELECT * FROM [TDSheet$];', conn)

        except Exception as e:
            conn.close()
            conn = pyodbc.connect('Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};'+\
                      'DBQ=C:\\Path\\To\\Workbooks\\{};'.format(save_xlsx(f)), autocommit=True)
            df = pd.read_excel('SELECT * FROM [TDSheet$];', conn)
            conn.close()

        print('File: {}, Shape: {}'.format(f, df.shape))
        dfs.append(df)

答案 1 :(得分:1)

看起来您的某些Excel文件格式不正确:

import os
import glob
import pandas as pd

excel_files_mask = r'D:\temp\.data\42468475\*.xlsx'

files = glob.glob(excel_files_mask)

def merge_excel_files(files, **kwargs):
    #return pd.concat([pd.read_excel(f, **kwargs) for f in files],
    #                 ignore_index=True)
    dfs = []
    for f in files:
        #print('processing: [{}]'.format(f))
        try:
            df = pd.read_excel(f, **kwargs)
            dfs.append(df)
            print('parsed: [{}], shape: {}'.format(f, df.shape))
        except KeyError:
            print("ERROR: file [{}] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...".format(f))
    return pd.concat(dfs, ignore_index=True)

df = merge_excel_files(files, header=None, skiprows=1)
print(df.shape)

收率:

parsed: [D:\temp\.data\42468475\2016-06-20–2016-06-26.xlsx], shape: (5912, 27)
parsed: [D:\temp\.data\42468475\2016-06-27–2016-07-03.xlsx], shape: (5362, 27)
parsed: [D:\temp\.data\42468475\2016-07-04–2016-07-10.xlsx], shape: (5387, 27)
parsed: [D:\temp\.data\42468475\2016-07-11–2016-07-17.xlsx], shape: (5331, 28)
parsed: [D:\temp\.data\42468475\2016-08-01–2016-08-07.xlsx], shape: (4965, 28)
ERROR: file [D:\temp\.data\42468475\2016-08-15–2016-08-21.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
parsed: [D:\temp\.data\42468475\2016-08-22–2016-08-28.xlsx], shape: (5179, 27)
parsed: [D:\temp\.data\42468475\2016-08-29–2016-09-04.xlsx], shape: (5855, 27)
ERROR: file [D:\temp\.data\42468475\2016-09-05–2016-09-11.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-09-12–2016-09-18.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-09-19–2016-09-25.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
parsed: [D:\temp\.data\42468475\2016-09-26–2016-10-02.xlsx], shape: (7018, 27)
parsed: [D:\temp\.data\42468475\2016-09.xlsx], shape: (23874, 27)
parsed: [D:\temp\.data\42468475\2016-10-03–2016-10-09.xlsx], shape: (6587, 27)
ERROR: file [D:\temp\.data\42468475\2016-10-10–2016-10-12.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-10-10–2016-10-13.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-10-17–2016-10-20.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-10-17–2016-10-23.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-10-24–2016-10-27.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-10-24–2016-10-30.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-10-31–2016-11-06.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-10.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-11-07–2016-11-13.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-11-14–2016-11-20.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-11-21.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-11-21–2016-11-23.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-11-28–2016-12-04.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-11.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-12-05–2016-12-11.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-12-12–2016-12-18.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-12-16–2016-12-22.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2016-12-26–2016-12-31.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2017-01-01–2017-01-08.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2017-01-09–2017-01-15.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2017-01-23–2017-01-29.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2017-01-30–2017-02-05.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2017-02-13–2017-02-12.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\2017-02-13–2017-02-19.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
parsed: [D:\temp\.data\42468475\test.xlsx], shape: (5315, 27)
ERROR: file [D:\temp\.data\42468475\Выгрузка 12-15.12.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
ERROR: file [D:\temp\.data\42468475\Выгрузка 21-27.xlsx] couldn't be parsed! Open it in Excel and save it as (.xlsx) file ...
parsed: [D:\temp\.data\42468475\Выгрузка 26-29.12.xlsx], shape: (4539, 27)
(85324, 28)

答案 2 :(得分:0)

我有同样的错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/tassos/.local/lib/python3.6/site-packages/pandas/io/excel/_base.py", line 867, in __init__
    self._reader = self._engines[engine](self._io)
  File "/home/tassos/.local/lib/python3.6/site-packages/pandas/io/excel/_xlrd.py", line 22, in __init__
    super().__init__(filepath_or_buffer)
  File "/home/tassos/.local/lib/python3.6/site-packages/pandas/io/excel/_base.py", line 353, in __init__
    self.book = self.load_workbook(filepath_or_buffer)
  File "/home/tassos/.local/lib/python3.6/site-packages/pandas/io/excel/_xlrd.py", line 37, in load_workbook
    return open_workbook(filepath_or_buffer)
  File "/home/tassos/.local/lib/python3.6/site-packages/xlrd/__init__.py", line 138, in open_workbook
    ragged_rows=ragged_rows,
  File "/home/tassos/.local/lib/python3.6/site-packages/xlrd/xlsx.py", line 812, in open_workbook_2007_xml
    x12book.process_stream(zflo, 'Workbook')
  File "/home/tassos/.local/lib/python3.6/site-packages/xlrd/xlsx.py", line 271, in process_stream
    meth(self, elem)
  File "/home/tassos/.local/lib/python3.6/site-packages/xlrd/xlsx.py", line 380, in do_sheet
    reltype = self.relid2reltype[rid]
KeyError: ''

发现问题在于,只要通过其他程序打开文件,我就已经使用电子表格程序(Libre-Spreadsheets)同时打开了文件

pd.read_excel("file.xlsx","Sheet1") 

不会运行。这是Excel文件被认为格式错误的另一种情况!希望对您有帮助!