将Pandas DataFrame转换为bin频率

时间:2014-04-25 23:33:27

标签: python pandas

使用pandas,我知道如何对单个列进行分区,但我正在努力计算如何进行多个列,然后查找二进制数的计数(频率),因为我的数据帧有20列。我知道我可以使用我用于单列的方法20次,但我有兴趣学习一种新的更好的方法。这是数据框的头部,有4列显示:

      Percentile1 Percentile2 Percentile3   Percentile4
395     0.166667    0.266667    0.266667    0.133333
424     0.266667    0.266667    0.133333    0.032258
511     0.032258    0.129032    0.129032    0.387097
540     0.129032    0.129032    0.387097    0.612903
570     0.129032    0.387097    0.612903    0.741935

我创建了以下bin数组

output = ['0-10','10-20','20-30','30-40','40-50','50-60','60-70','70-80','80-90','90-100']

这是我想要的输出:

      Percentile1 Percentile2 Percentile3   Percentile4
395     10-20        20-30      20-30           10-20
424     20-30        20-30      10-20           0-10
511     0-10         10-20      10-20           30-40
540     10-20        10-20      30-40           60-70
570     10-20        30-40      60-70           70-80

在此之后我理想地做一个频率/值计数来得到这样的东西:

      Percentile1 Percentile2 Percentile3   Percentile4
0-10    frequency #'s        
10-20   
20-30   
30-40   
40-50   
etc...

任何帮助将不胜感激

2 个答案:

答案 0 :(得分:2)

我可能会做以下事情:

print df

   Percentile1  Percentile2  Percentile3  Percentile4
0     0.166667     0.266667     0.266667     0.133333
1     0.266667     0.266667     0.133333     0.032258
2     0.032258     0.129032     0.129032     0.387097
3     0.129032     0.129032     0.387097     0.612903
4     0.129032     0.387097     0.612903     0.741935

现在使用applycut创建一个新的数据框,用它所在的十进制数据库替换百分位数(应用迭代每列):

bins = xrange(0,110,10)
new = df.apply(lambda x: pd.Series(pd.cut(x*100,bins)))
print new

  Percentile1 Percentile2 Percentile3 Percentile4
0    (10, 20]    (20, 30]    (20, 30]    (10, 20]
1    (20, 30]    (20, 30]    (10, 20]     (0, 10]
2     (0, 10]    (10, 20]    (10, 20]    (30, 40]
3    (10, 20]    (10, 20]    (30, 40]    (60, 70]
4    (10, 20]    (30, 40]    (60, 70]    (70, 80]

再次使用申请获得频率计数:

print new.apply(lambda x: x.value_counts()/x.count())

         Percentile1  Percentile2  Percentile3  Percentile4
(0, 10]           0.2          NaN          NaN          0.2
(10, 20]          0.6          0.4          0.4          0.2
(20, 30]          0.2          0.4          0.2          NaN
(30, 40]          NaN          0.2          0.2          0.2
(60, 70]          NaN          NaN          0.2          0.2
(70, 80]          NaN          NaN          NaN          0.2

或值计数:

print new.apply(lambda x: x.value_counts())

          Percentile1  Percentile2  Percentile3  Percentile4
(0, 10]             1          NaN          NaN            1
(10, 20]            3            2            2            1
(20, 30]            1            2            1          NaN
(30, 40]          NaN            1            1            1
(60, 70]          NaN          NaN            1            1
(70, 80]          NaN          NaN          NaN            1

另一种方法不是创建中间数据帧(我称之为new),而是直接在一个命令中直接计算值:

print df.apply(lambda x: pd.value_counts(pd.cut(x*100,bins)))

          Percentile1  Percentile2  Percentile3  Percentile4 
(0, 10]             1          NaN          NaN            1
(10, 20]            3            2            2            1
(20, 30]            1            2            1          NaN
(30, 40]          NaN            1            1            1
(60, 70]          NaN          NaN            1            1
(70, 80]          NaN          NaN          NaN            1

答案 1 :(得分:0)

如果您想要'0-10'等代替(20, 30]提供的pd.cut,则可以采用其他方式。

In [52]:

output = ['0-10','10-20','20-30','30-40','40-50','50-60','60-70','70-80','80-90','90-100']
df2=(df*10).astype(int)
df2=df2.applymap(lambda x: output[x])
print df2
    Percentile1 Percentile2 Percentile3 Percentile4
395       10-20       20-30       20-30       10-20
424       20-30       20-30       10-20        0-10
511        0-10       10-20       10-20       30-40
540       10-20       10-20       30-40       60-70
570       10-20       30-40       60-70       70-80

[5 rows x 4 columns]

In [53]:
print df2.apply(lambda x: x.value_counts()) #or /x.count()
level_1  Percentile1  Percentile2  Percentile3  Percentile4
class                                                      
0-10               1          NaN          NaN            1
10-20              3            2            2            1
20-30              1            2            1          NaN
30-40            NaN            1            1            1
60-70            NaN          NaN            1            1
70-80            NaN          NaN          NaN            1

[6 rows x 4 columns]