从标准化列转换为多级列

时间:2015-11-05 10:23:57

标签: python pandas

我有以下格式的csv,

print rfd.iloc[:5,:5]   

                            Sub-division   January 2010 Actual   January  2010 Normal   January 2011 Actual   February  2010 Actual 
0            Andaman and Nicobar Islands                   98.2                   53.7                 222.5                     5.8
1                       Arunachal Pradesh                   0.4                   50.1                  37.6                    10.0
2                     Assam and Meghalaya                   0.2                   16.4                   9.0                     3.4
3  Nagaland,Manipur, Mizoram, and Tripura                   0.9                   13.7                   7.9                    10.9
4     Sub-Himalayan,West Bengal & Sikkim                    1.7                   26.6                   7.1                     6.4

如何将其转换为多级列。第一级是Year,然后是Month和type。

rfd.columns
Out[89]: 
Index([u'Sub-division ', u'January 2010 Actual ', u'January  2010 Normal ',
       u'January 2011 Actual ', u'February  2010 Actual ',
     ....
       u'December  2010 Normal ', u'   December 2011 Actual '],
      dtype='object')

我尝试了类似rfd.columns = rfd.columns.str.split(" ")的内容,然后数据框变为TypeError: unhashable type: 'list'。如果它只是一个文件,我可以在csv中加载它并加载,但它是可重复的过程,所以寻找一些我可以迭代文件的解决方案。

添加两行字典

{'April  2010 Normal': {0: 81.5, 1: 278.80000000000001},
 'April 2010 Actual': {0: 12.699999999999999, 1: 245.80000000000001},
 'April 2011 Actual': {0: 83.700000000000003, 1: 114.7},
 'August  2010 Actual': {0: 550.0, 1: 343.30000000000001},
 'August  2010 Normal': {0: 403.80000000000001, 1: 359.89999999999998},
 'August 2011 Actual': {0: 513.0, 1: 225.80000000000001},
 'December  2010 Normal': {0: 145.5, 1: 38.399999999999999},
 'December 2010 Actual': {0: 254.40000000000001, 1: 6.0},
 'December 2011 Actual': {0: 246.30000000000001, 1: 10.300000000000001},
 'February  2010 Actual': {0: 5.7999999999999998, 1: 10.0},
 'February  2010 Normal': {0: 29.199999999999999, 1: 98.0},
 'February  2011 Actual': {0: 81.900000000000006, 1: 36.799999999999997},
 'January  2010 Normal': {0: 53.700000000000003, 1: 50.100000000000001},
 'January 2010 Actual': {0: 98.200000000000003, 1: 0.40000000000000002},
 'January 2011 Actual': {0: 222.5, 1: 37.600000000000001},
 'July  2010 Normal': {0: 407.69999999999999, 1: 536.10000000000002},
 'July 2010 Actual': {0: 522.10000000000002, 1: 426.0},
 'July 2011 Actual': {0: 575.79999999999995, 1: 553.5},
 'June  2010 Normal': {0: 438.60000000000002, 1: 500.39999999999998},
 'June  2011 Actual': {0: 418.39999999999998, 1: 336.80000000000001},
 'June 2010 Actual': {0: 435.0, 1: 397.30000000000001},
 'March   2010 Normal': {0: 25.0, 1: 179.69999999999999},
 'March  2010 Normal': {0: 20.5, 1: 164.40000000000001},
 'March  2011 Actual': {0: 305.5, 1: 121.5},
 'March 2010 Actual': {0: 0.40000000000000002, 1: 143.59999999999999},
 'May  2010 Actual': {0: 310.69999999999999, 1: 273.80000000000001},
 'May  2010 Normal': {0: 358.5, 1: 291.89999999999998},
 'May 2011 Actual': {0: 305.69999999999999, 1: 157.80000000000001},
 'November  2010 Normal': {0: 253.69999999999999, 1: 45.799999999999997},
 'November 2010 Actual': {0: 281.39999999999998, 1: 59.700000000000003},
 'November 2011 Actual': {0: 126.0, 1: 19.800000000000001},
 'October  2010 Actual': {0: 415.19999999999999, 1: 84.400000000000006},
 'October  2010 Normal': {0: 296.69999999999999, 1: 183.0},
 'October  2011 Actual': {0: 183.80000000000001, 1: 46.799999999999997},
 'September  2010 Normal': {0: 432.39999999999998, 1: 371.60000000000002},
 'September 2010 Actual': {0: 261.30000000000001, 1: 407.39999999999998},
 'September 2011 Actual': {0: 770.89999999999998, 1: 262.0},
 'Sub-division': {0: 'Andaman and Nicobar Islands ', 1: 'Arunachal Pradesh'},
 'october  2010 Normal': {0: 297.80000000000001, 1: 159.09999999999999}}

1 个答案:

答案 0 :(得分:1)

我很确定这不是最好的方式'做到这一点,可能不是很理想

import pandas as pd

a = pd.read_csv('data.csv', sep=';')
b = a.set_index('Sub-division').unstack().reset_index()
c = b['level_0']

d = c.str.extract('(?P<Month>[A-Za-z]*) +(?P<Year>[0-9][\w\d]*) +(?P<Level>[A-Za-z]*)')

e = pd.concat([b[['Sub-division',0]], d], axis=1)

f = e.set_index(['Sub-division', 'Year', 'Month', 'Level'])

f = f.unstack(['Year','Month','Level'])

f.columns = f.columns.droplevel(0)

f.sortlevel(level=0,axis=1)

但它可以做你想要的,你正在寻找的功能可能是 pd.str.extract

输出:

Year                                      2010                   2011
Month                                 February January        January
Level                                   Actual  Actual Normal  Actual
Sub-division                                                         
Andaman and Nicobar Islands                5.8    98.2   53.7   222.5
Arunachal Pradesh                         10.0     0.4   50.1    37.6
Assam and Meghalaya                        3.4     0.2   16.4     9.0
Nagaland,Manipur, Mizoram and Tripura     10.9     0.9   13.7     7.9
Sub-Himalayan,West Bengal & Sikkim         6.4     1.7   26.6     7.1

你在熊猫中有特殊工具来处理时间序列,所以你可以更好地表达你在这里看到的内容。