假设我有以下数据:
import pandas as pd
import numpy as np
import random
from string import ascii_uppercase
random.seed(100)
n = 1000000
# Create a bunch of factor data... throw some NaNs in there for good measure
data = {letter: [random.choice(list(ascii_uppercase) + [np.nan]) for _ in range(n)] for letter in ascii_uppercase}
df = pd.DataFrame(data)
我希望快速计算数据框中所有值集合中每个值的全局出现次数。
这有效:
from collections import Counter
c = Counter([v for c in df for v in df[c].fillna(-999)])
但是很慢:
%timeit Counter([v for c in df for v in df[c].fillna(-999)])
1 loop, best of 3: 4.12 s per loop
我认为这个功能可以通过使用一些熊猫的马力加快速度:
def quick_global_count(df, na_value=-999):
df = df.fillna(na_value)
# Get counts of each element for each column in the passed dataframe
group_bys = {c: df.groupby(c).size() for c in df}
# Stack each of the Series objects in `group_bys`... This is faster than reducing a bunch of dictionaries by keys
stacked = pd.concat([v for k, v in group_bys.items()])
# Call `reset_index()` to access the index column, which indicates the factor level for each column in dataframe
# Then groupby and sum on that index to get global counts
global_counts = stacked.reset_index().groupby('index').sum()
return global_counts
它肯定更快(前一种方法的75%),但必须有更快的东西......
%timeit quick_global_count(df)
10 loops, best of 3: 3.01 s per loop
上述两种方法的结果完全相同(对quick_global_count
返回的结果稍作修改):
dict(c) == quick_global_count(df).to_dict()[0]
True
在数据框中计算全局值的更快方法是什么?
答案 0 :(得分:6)
方法#1
NumPy的诀窍就是转换为数字(这是NumPy发光的地方)并让bincount
进行计数 -
a = df.fillna('[').values.astype(str).view(np.uint8)
count = np.bincount(a.ravel())[65:-1]
这适用于单个字符。 np.bincount(a.ravel())
保留所有角色的计数。
方法#1S(超级收费)
以前的方法在字符串转换时存在瓶颈:astype(str)
。此外,fillna()
是另一个显示阻止者。通过绕过这些瓶颈,需要更多的技巧来超级充电。现在,可以预先使用astype('S1')
来强制一切为单个字符。因此,单个字符保持不变,而NaN只减少为单个字符'n'
。这样我们就可以跳过fillna
,因为稍后可以通过索引轻松跳过'n'
的计数。
因此,实施将是 -
def app1S(df):
ar = df.values.astype('S1')
a = ar.view(np.uint8)
count = np.bincount(a.ravel())[65:65+26]
return count
pandas-0.20.3
和numpy-1.13.3
上的计时 -
In [3]: # Setup input
...: random.seed(100)
...: n = 1000000
...: data = {letter: [random.choice(list(ascii_uppercase) +
...: [np.nan]) for _ in range(n)] for letter in ascii_uppercase}
...: df = pd.DataFrame(data)
...:
# @Wen's soln
In [4]: %timeit df.melt().value.value_counts()
1 loop, best of 3: 2.5 s per loop
# @andrew_reece's soln
In [5]: %timeit df.apply(pd.value_counts).sum(axis=1)
1 loop, best of 3: 2.14 s per loop
# Super-charged one
In [6]: %timeit app1S(df)
1 loop, best of 3: 501 ms per loop
通用案例
我们还可以np.unique
来涵盖一般情况(包含多个单字符的数据) -
unq, count = np.unique(df.fillna(-999), return_counts=1)
答案 1 :(得分:5)
df.apply(pd.value_counts).sum(axis=1)
基准:
# example data
N = 10000000
rownum = int(N/1000.)
colnum = int(N/10000.)
str_vals = ['A','B','C','D']
str_data = np.random.choice(str_vals, size=N).reshape(rownum, colnum)
str_df = pd.DataFrame(str_data)
num_vals = [1,2,3,4]
num_data = np.random.choice(num_vals, size=N).reshape(rownum, colnum)
num_df = pd.DataFrame(num_data)
num_df.shape
# (10000, 1000)
%%timeit
num_df.apply(pd.value_counts).sum(axis=1)
# 1 loop, best of 3: 883 ms per loop
%%timeit
str_df.apply(pd.value_counts).sum(axis=1)
# 1 loop, best of 3: 2.76 s per loop
答案 2 :(得分:4)
melt
然后value_counts
(PS,仍无法与numpy
解决方案进行比较)
df.melt().value.value_counts()
时间
%timeit df.melt().value.value_counts()
100 loops, best of 3: 1.43 ms per loop
%timeit c = Counter([v for c in df for v in df[c].fillna(-999)])
100 loops, best of 3: 5.23 ms per loop
%timeit df.apply(pd.value_counts).sum()
100 loops, best of 3: 18.5 ms per loop