使用pandas,我知道如何对单个列进行分区,但我正在努力计算如何进行多个列,然后查找二进制数的计数(频率),因为我的数据帧有20列。我知道我可以使用我用于单列的方法20次,但我有兴趣学习一种新的更好的方法。这是数据框的头部,有4列显示:
Percentile1 Percentile2 Percentile3 Percentile4
395 0.166667 0.266667 0.266667 0.133333
424 0.266667 0.266667 0.133333 0.032258
511 0.032258 0.129032 0.129032 0.387097
540 0.129032 0.129032 0.387097 0.612903
570 0.129032 0.387097 0.612903 0.741935
我创建了以下bin数组
output = ['0-10','10-20','20-30','30-40','40-50','50-60','60-70','70-80','80-90','90-100']
这是我想要的输出:
Percentile1 Percentile2 Percentile3 Percentile4
395 10-20 20-30 20-30 10-20
424 20-30 20-30 10-20 0-10
511 0-10 10-20 10-20 30-40
540 10-20 10-20 30-40 60-70
570 10-20 30-40 60-70 70-80
在此之后我理想地做一个频率/值计数来得到这样的东西:
Percentile1 Percentile2 Percentile3 Percentile4
0-10 frequency #'s
10-20
20-30
30-40
40-50
etc...
任何帮助将不胜感激
答案 0 :(得分:2)
我可能会做以下事情:
print df
Percentile1 Percentile2 Percentile3 Percentile4
0 0.166667 0.266667 0.266667 0.133333
1 0.266667 0.266667 0.133333 0.032258
2 0.032258 0.129032 0.129032 0.387097
3 0.129032 0.129032 0.387097 0.612903
4 0.129032 0.387097 0.612903 0.741935
现在使用apply
和cut
创建一个新的数据框,用它所在的十进制数据库替换百分位数(应用迭代每列):
bins = xrange(0,110,10)
new = df.apply(lambda x: pd.Series(pd.cut(x*100,bins)))
print new
Percentile1 Percentile2 Percentile3 Percentile4
0 (10, 20] (20, 30] (20, 30] (10, 20]
1 (20, 30] (20, 30] (10, 20] (0, 10]
2 (0, 10] (10, 20] (10, 20] (30, 40]
3 (10, 20] (10, 20] (30, 40] (60, 70]
4 (10, 20] (30, 40] (60, 70] (70, 80]
再次使用申请获得频率计数:
print new.apply(lambda x: x.value_counts()/x.count())
Percentile1 Percentile2 Percentile3 Percentile4
(0, 10] 0.2 NaN NaN 0.2
(10, 20] 0.6 0.4 0.4 0.2
(20, 30] 0.2 0.4 0.2 NaN
(30, 40] NaN 0.2 0.2 0.2
(60, 70] NaN NaN 0.2 0.2
(70, 80] NaN NaN NaN 0.2
或值计数:
print new.apply(lambda x: x.value_counts())
Percentile1 Percentile2 Percentile3 Percentile4
(0, 10] 1 NaN NaN 1
(10, 20] 3 2 2 1
(20, 30] 1 2 1 NaN
(30, 40] NaN 1 1 1
(60, 70] NaN NaN 1 1
(70, 80] NaN NaN NaN 1
另一种方法不是创建中间数据帧(我称之为new
),而是直接在一个命令中直接计算值:
print df.apply(lambda x: pd.value_counts(pd.cut(x*100,bins)))
Percentile1 Percentile2 Percentile3 Percentile4
(0, 10] 1 NaN NaN 1
(10, 20] 3 2 2 1
(20, 30] 1 2 1 NaN
(30, 40] NaN 1 1 1
(60, 70] NaN NaN 1 1
(70, 80] NaN NaN NaN 1
答案 1 :(得分:0)
如果您想要'0-10'
等代替(20, 30]
提供的pd.cut
,则可以采用其他方式。
In [52]:
output = ['0-10','10-20','20-30','30-40','40-50','50-60','60-70','70-80','80-90','90-100']
df2=(df*10).astype(int)
df2=df2.applymap(lambda x: output[x])
print df2
Percentile1 Percentile2 Percentile3 Percentile4
395 10-20 20-30 20-30 10-20
424 20-30 20-30 10-20 0-10
511 0-10 10-20 10-20 30-40
540 10-20 10-20 30-40 60-70
570 10-20 30-40 60-70 70-80
[5 rows x 4 columns]
In [53]:
print df2.apply(lambda x: x.value_counts()) #or /x.count()
level_1 Percentile1 Percentile2 Percentile3 Percentile4
class
0-10 1 NaN NaN 1
10-20 3 2 2 1
20-30 1 2 1 NaN
30-40 NaN 1 1 1
60-70 NaN NaN 1 1
70-80 NaN NaN NaN 1
[6 rows x 4 columns]