假设我有这样的Dataframe。我想将其转换为2级multiIndex数据帧。
dt st close volume
0 20100101 000001.sz 1 10000
1 20100101 000002.sz 10 50000
2 20100101 000003.sz 5 1000
3 20100101 000004.sz 15 7000
4 20100101 000005.sz 100 100000
5 20100102 000001.sz 2 20000
6 20100102 000002.sz 20 60000
7 20100102 000003.sz 6 2000
8 20100102 000004.sz 20 8000
9 20100102 000005.sz 110 110000
但是当我尝试这段代码时:
data = pd.read_csv('data/trial.csv')
print(data)
idx = pd.MultiIndex.from_product([data.dt.unique(),
data.st.unique()],
names=['dt', 'st'])
col = ['close', 'volume']
df = pd.DataFrame(data, idx, col)
print(df)
我发现所有元素都是NaN
close volume
dt st
20100101 000001.sz NaN NaN
000002.sz NaN NaN
000003.sz NaN NaN
000004.sz NaN NaN
000005.sz NaN NaN
20100102 000001.sz NaN NaN
000002.sz NaN NaN
000003.sz NaN NaN
000004.sz NaN NaN
000005.sz NaN NaN
如何处理这种情况?感谢。
答案 0 :(得分:3)
尝试使用set_index()
,如下所示:
new_df = df.set_index(['dt', 'st'])
结果:
>>> new_df
close volume
dt st
20100101 000001.sz 1 10000
000002.sz 10 50000
000003.sz 5 1000
000004.sz 15 7000
000005.sz 100 100000
20100102 000001.sz 2 20000
000002.sz 20 60000
000003.sz 6 2000
000004.sz 20 8000
000005.sz 110 110000
>>> new_df.index
MultiIndex(levels=[[20100101, 20100102], ['000001.sz', '000002.sz', '000003.sz', '000004.sz', '000005.sz']],
labels=[[0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]],
names=['dt', 'st'])
答案 1 :(得分:3)
read_csv
中只需要参数index_col
:
#by positions of columns
data = pd.read_csv('data/trial.csv', index_col=[0,1])
或者:
#by names of columns
data = pd.read_csv('data/trial.csv', index_col=['dt', 'st'])
print (data)
close volume
dt st
20100101 000001.sz 1 10000
000002.sz 10 50000
000003.sz 5 1000
000004.sz 15 7000
000005.sz 100 100000
20100102 000001.sz 2 20000
000002.sz 20 60000
000003.sz 6 2000
000004.sz 20 8000
000005.sz 110 110000
为什么构造multiIndex数据帧时所有元素都是NaN?
原因在DataFrame
构造函数中:
df = pd.DataFrame(data, idx, col)
名为DataFrame
的 data
已RangeIndex
且未与新MultiIndex
对齐,因此请在数据中获取NaN
。
如果始终每个dt
具有相同的st
值,则可能的解决方案是按列名称过滤数据帧,然后转换为numpy array
,但更好的是index_col
和{{1}解决方案:
set_index