我在尝试设置数据帧的索引时遇到了错误的错误。在以同样的方式设置索引之前我没有遇到过这种情况,我想知道出了什么问题?数据没有列标题,因此DataFrame标题为0,1,2,4,5等。任何列标题都会出错。
我收到了KeyError:' 0'尝试使用第一列时(我想将其用作唯一索引)。
上下文: 在下面的示例中,我选择启用宏的Excel电子表格,压缩数据,读取并将其转换为DataFrame。
然后我想在列中包含文件名,设置索引并去掉空格,以便我可以使用索引标签来提取我需要的数据。并非每个工作表都有索引标签,所以我尝试了,除了跳过不在索引中包含这些标签的工作表。然后我想将每个结果连接到一个DataFrame中并挤压未使用的列。
import itertools
import glob
from openpyxl import load_workbook
from pandas import DataFrame
import pandas as pd
import os
def get_data(ws):
for row in ws.values:
row_it = iter(row)
for cell in row_it:
if cell is not None:
yield itertools.chain((cell,), row_it)
break
def read_workbook(file_):
wb = load_workbook(file_, data_only=True)
for sheet in wb.worksheets:
ws = sheet
return DataFrame(get_data(ws))
path =r'dir'
allFiles = glob.glob(path + "/*.xlsm")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
parsed_file = read_workbook(file_)
parsed_file['filename'] = os.path.basename(file_)
parsed_file.set_index(['0'], inplace = True)
parsed_file.index.str.strip()
try:
parsed_file.loc["Staff" : "Total"].copy()
list_.append(parsed_file)
except KeyError:
pass
frame = pd.concat(list_)
print(frame.dropna(axis='columns', thresh=2, inplace = True))
示例数据框,所需的索引位置和要提取的标签。
index
0 1 2
0 5 2 4
1 RTJHD 5 9
2 ABCD 4 6
3 Staff 9 3 --- extract from here
4 FHDHSK 3 2
5 IRRJWK 7 1
6 FJDDCN 1 8
7 67 4 7
8 Total 5 3 --- to here
错误
Traceback (most recent call last):
File "<ipython-input-29-d8fd24ca84ec>", line 1, in <module>
runfile('dir.py', wdir='C:/dir/Documents')
File "C:\ProgramData\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
execfile(filename, namespace)
File "C:\ProgramData\Anaconda2\lib\site-packages\spyder\utils\site\sitecustomize.py", line 87, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "dir.py", line 36, in <module>
parsed_file.set_index(['0'], inplace = True)
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py", line 2830, in set_index
level = frame[col]._values
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py", line 1964, in __getitem__
return self._getitem_column(key)
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py", line 1971, in _getitem_column
return self._get_item_cache(key)
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\generic.py", line 1645, in _get_item_cache
values = self._data.get(item)
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\internals.py", line 3590, in get
loc = self.items.get_loc(item)
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\indexes\base.py", line 2444, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\_libs\index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5280)
File "pandas\_libs\index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5126)
File "pandas\_libs\hashtable_class_helper.pxi", line 1210, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20523)
File "pandas\_libs\hashtable_class_helper.pxi", line 1218, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20477)
KeyError: '0'
答案 0 :(得分:1)
您收到此错误是因为您的数据框在没有任何标头的情况下被读入。这意味着您的标头属于Int64Index
类型:
Int64Index([0, 1, 2, 3, ...], dtype='int64')
此时,我建议您只需按索引访问df.columns
,无论您何时被迫处理它们:
parsed_file.set_index(parsed_file.columns[0], inplace = True)
如果您通过索引访问,请不要对列名进行硬编码。另一种方法是分配一些你自己的列名,然后引用它们。