Question

我有一个pandas数据框，它对列和行都使用多索引。这是一个简化的例子：

import pandas as pd
import datetime

col1 = [datetime.date(2018, 1, 1)+i*datetime.timedelta(days=1) for i in range(4) for j in range(2)]
col2 = [     'lunch',     'dinner',      'lunch',     'dinner',
             'lunch',     'dinner',      'lunch',     'dinner']

hdr1 = ['starter', 'starter', 'main', 'main', ' main', 'dessert', 'dessert']
hdr2 = [        0,         1,      0,      1,       2,         0,         1]

col_index = pd.MultiIndex.from_arrays([col1, col2], names=['date', 'meal'])
row_index = pd.MultiIndex.from_arrays([hdr1, hdr2], names=['dish', 'content'])

df = pd.DataFrame(index=col_index, columns=row_index)

如果我尝试打印特定的单元格，它可以正常工作：

In [45]: df.loc[('2018-01-02', 'lunch'), ('starter', 0)]
Out[45]: 
date        meal 
2018-01-02  lunch    NaN
Name: (starter, 0), dtype: object

但是，如果我将其保存为CSV文件并再次阅读：

# write to CSV
df.to_csv('test.csv')

# read from CSV
df = pd.read_csv('test.csv', index_col=[0, 1], header=[0, 1], parse_dates=True)

以下是同一命令的结果：

In [47]: df.loc[('2018-01-02', 'lunch'), ('starter', 0)]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-47-d0a7081dd150> in <module>()
----> 1 df.loc[('2018-01-02', 'lunch'), ('starter', 0)]
[...]

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ('starter', 0)

原因是当从CSV文件读取数据帧时，列索引的第二级不再是整数，而是字符串：

In [55]: df.columns.get_level_values(1)
Out[55]: Index(['0', '1', '0', '1', '2', '0', '1'], dtype='object', name='content')

如何强制将第二级索引作为整数而不是字符串读取？

Answer 1

我担心唯一的选择就是咬紧牙关并修好标头;使用pd.MultiIndex.set_levels。

df.columns = df.columns.set_levels(
      df.columns.get_level_values(1).astype(int), level=1
)

df.loc[('2018-01-02', 'lunch'), ('starter', 0)]

date        meal 
2018-01-02  lunch    NaN
Name: (starter, 0), dtype: object

Answer 2

尝试使用dtype=的{{1}}选项：https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html 这将允许您在导入时显式设置列的类型。

读取CSV文件时，pandas多索引列标题更改类型

2 个答案: