本周我决定潜入大熊猫。我有一个带有历史IRC日志的pandas DataFrame,如下所示:
timestamp action nick message
2005-11-04 01:44:33 False hack-cclub lex, hey!
2005-11-04 01:44:43 False hack-cclub lol, yea thats broke
2005-11-04 01:44:56 False lex Slashdot - Updated 2005-11-04 00:23:00 | Micro...
2005-11-04 01:44:56 False hack-cclub lex slashdot
2005-11-04 01:45:12 False lex port 666 is doom - doom Id Software (or mdqs o..
2005-11-04 01:45:12 False hack-cclub lex, port 666
2005-11-04 01:45:21 False hitokiri lex, port 23485
2005-11-04 01:45:45 False hitokiri lex, port 1024
2005-11-04 01:45:46 True hack-cclub slaps lex around with a wet fish
大约有5.5万行,我试图制作一些基本的可视化,例如排名前25位的缺口等等。我知道我可以得到这样的前25个缺口:
df['nick'].value_counts()[:25]
我想要的是这样的滚动计数:
hack-cclub lex hitokiri
1 0 0
2 0 0
2 1 0
3 1 0
3 2 0
4 2 0
4 2 1
4 2 2
5 2 2
这样我就可以从前25个刻痕开始绘制消息的区域图。我知道我可以通过迭代整个数据框并保持计数来做到这一点但是因为这样做的全部意义是学习使用pandas我希望有更多的惯用方法来做到这一点。拥有相同的数据但使用排名而不是像这样运行计数也是很好的:
hack-cclub lex hitokiri
1 2 2
1 2 2
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 2
1 2 2
答案 0 :(得分:2)
print df[['timestamp', 'nick']]
timestamp nick
0 2005-11-04 01:44:33 hack-cclub
1 2005-11-04 01:44:43 hack-cclub
2 2005-11-04 01:44:56 lex
3 2005-11-04 01:44:56 hack-cclub
4 2005-11-04 01:45:12 lex
5 2005-11-04 01:45:12 hack-cclub
6 2005-11-04 01:45:21 hitokiri
7 2005-11-04 01:45:45 hitokiri
8 2005-11-04 01:45:46 hack-cclub
df = pd.crosstab(df.timestamp, df.nick)
print df
nick hack-cclub hitokiri lex
timestamp
2005-11-04 01:44:33 1 0 0
2005-11-04 01:44:43 1 0 0
2005-11-04 01:44:56 1 0 1
2005-11-04 01:45:12 1 0 1
2005-11-04 01:45:21 0 1 0
2005-11-04 01:45:45 0 1 0
2005-11-04 01:45:46 1 0 0
df = df.cumsum()
print df
nick hack-cclub hitokiri lex
timestamp
2005-11-04 01:44:33 1 0 0
2005-11-04 01:44:43 2 0 0
2005-11-04 01:44:56 3 0 1
2005-11-04 01:45:12 4 0 2
2005-11-04 01:45:21 4 1 2
2005-11-04 01:45:45 4 2 2
2005-11-04 01:45:46 5 2 2