我有一个大约150,000,000行的pandas数据帧,格式如下:
df.head()
Out[1]:
ID TERM X
0 1 A 0
1 1 A 4
2 1 A 6
3 1 B 0
4 1 B 10
5 2 A 1
6 2 B 1
7 2 F 1
我希望通过ID& TERM,并计算行数。目前我做以下事情:
df.groupby(['ID','TERM']).count()
Out[2]:
ID TERM X
0 1 A 3
1 1 B 2
2 2 A 1
3 2 B 1
4 2 F 1
但这大约需要两分钟。使用R data.tables的相同操作只需不到22秒。在python中有更有效的方法吗?
为了比较,R data.table:
system.time({ df[,.(.N), .(ID, TERM)] })
#user: 30.32 system: 2.45 elapsed: 22.88
答案 0 :(得分:2)
NumPy解决方案就是这样 -
def groupby_count_v2(df):
a = df.values
sidx = np.lexsort(a[:,:2].T)
b = a[sidx,:2]
m = np.concatenate(([True],(b[1:] != b[:-1]).any(1),[True]))
out_ar = np.column_stack((b[m[:-1],:2], np.diff(np.flatnonzero(m)+1)))
return pd.DataFrame(out_ar, columns = [['ID','TERM','X']])
更简单的版本 -
In [332]: df
Out[332]:
ID TERM X
0 1 A 0
1 1 A 4
2 1 A 6
3 1 B 0
4 1 B 10
5 2 A 1
6 2 B 1
7 2 F 1
In [333]: groupby_count(df)
Out[333]:
ID TERM X
0 1 A 3
1 1 B 2
2 2 A 1
3 2 B 1
4 2 F 1
示例运行 -
In [339]: df1 = df.iloc[np.random.permutation(len(df))]
In [340]: df1
Out[340]:
ID TERM X
7 2 F 1
6 2 B 1
0 1 A 0
3 1 B 0
5 2 A 1
2 1 A 6
1 1 A 4
4 1 B 10
In [341]: groupby_count(df1)
Out[341]:
ID TERM X
0 1 A 3
1 1 B 2
2 2 A 1
3 2 B 1
4 2 F 1
让我们随机对行进行随机播放,并验证它是否适用于我们的解决方案 -
group <- c(1,2,3,1,2,3)
species <- c("rabbit","rabbit","rabbit","plant","plant","plant")
t1 <- c(66,77,80,4,3,1)
t2 <- c(4,5,22,1,2,6)
t100 <- c(56,78,22,1,6,7)
df <- data.frame(group, species,t1,t2,t100)