在pandas数据帧中对多级索引进行排序,并删除csv写入中的空行

时间:2017-08-21 23:31:37

标签: pandas sorting dataframe multi-index

这是一个创建两级数据框的简单代码。

import pandas as pd
import numpy as np

dates = pd.date_range('20130101', periods=4)
df = pd.DataFrame(np.random.randn(6,4), columns=dates)
df1 = pd.DataFrame({'E' : ["foo1","foo2","foo3","foo4","foo5","foo6"]})
df2=pd.concat(dict(L0 = df, L1 = df1),axis=1)

df2输出:

                   L0                                          \
  2013-01-01 00:00:00 2013-01-02 00:00:00 2013-01-03 00:00:00   
0            0.530496           -1.536075           -0.592824   
1            0.614626            0.146761            1.799287   
2           -0.398504           -0.863021           -0.208724   
3            0.901720            0.717144            1.504012   
4           -0.570248           -0.967722           -0.478540   
5            2.225644            2.452121           -0.131774   

                         L1  
  2013-01-04 00:00:00     E  
0            1.293738  foo1  
1            1.469431  foo2  
2           -2.084461  foo3  
3           -0.199157  foo4  
4           -1.627641  foo5  
5           -1.970185  foo6  

我有这三个问题。请帮助:

1)如何重新排列列,使日期按降序排列? 2)如何仅在列标题中显示日期(而不是时间戳)? 3)如果你将df2写入csv,它会创建一个空行。我读了一些QA,它表示多级输出的错误。这是固定的吗?如果没有,删除它的最佳方法是什么?

1 个答案:

答案 0 :(得分:1)

假设您可以在构建df2期间解决问题,那么 问题可以通过对df的列进行排序然后转动列来解决 标签到字符串:

df = df.sort_index(ascending=False, axis=1)
df.columns = df.columns.format()

使用当前版本的pandas,0.21.0(dev),

df2.to_csv('/tmp/test.csv')

创建一个没有空行的CSV。如果您使用最新的稳定版本0.20.3进行尝试,我认为您会得到相同的结果(见下文)。

例如,

import pandas as pd
import numpy as np

dates = pd.date_range('20130101', periods=4)
df = pd.DataFrame(np.random.randn(6,4), columns=dates)
df1 = pd.DataFrame({'E' : ["foo1","foo2","foo3","foo4","foo5","foo6"]})

df = df.sort_index(ascending=False, axis=1)
df.columns = df.columns.format()

df2 = pd.concat(dict(L0=df, L1=df1),axis=1)
df2.to_csv('/tmp/test.csv')

使用内容

创建/tmp/test.csv
,L0,L0,L0,L0,L1
,2013-01-04,2013-01-03,2013-01-02,2013-01-01,E
0,0.02140012949846106,0.26277798576234707,0.3417048534674754,-0.2415864990096712,foo1
1,1.5529608360704856,0.04473119120484416,0.2563552549068564,-0.7234609815350183,foo2
2,0.3197702495146119,-0.4796536804964018,-1.0049744963838612,0.039249748655535384,foo3
3,-1.5129389373140296,-0.2528463527601262,-0.22930219559242235,-0.6661663277403631,foo4
4,0.03756426242171489,0.20880577998533037,1.0229358239647364,0.6539470866256701,foo5
5,-1.8477638391042324,-0.8315712350681457,-0.0743680147471108,0.8503850287138673,foo6

顺便说一下,您可能还想考虑这种格式,这似乎更紧凑:

import pandas as pd
import numpy as np

dates = pd.date_range('20130101', periods=4)
df = pd.DataFrame(np.random.randn(6,4), columns=dates)
df1 = pd.DataFrame({'E' : ["foo1","foo2","foo3","foo4","foo5","foo6"]})

df = df.T
df.columns = df1['E']
print(df)

产量

E               foo1      foo2      foo3      foo4      foo5      foo6
2013-01-01  0.166074  0.398726 -0.410202  0.397486 -0.811873  0.462652
2013-01-02  0.406810 -0.313234  0.062569 -0.140924 -1.087162  1.600549
2013-01-03 -0.573118  1.331461 -0.115200 -1.934654 -1.427441 -0.889541
2013-01-04 -0.919885 -1.197192 -0.476039  1.186531  1.013803  0.400977