快速计算pandas DataFrame中所有值的出现次数

时间:2017-10-15 20:05:52

标签: python pandas numpy

假设我有以下数据:

import pandas as pd
import numpy as np
import random
from string import ascii_uppercase

random.seed(100)

n = 1000000

# Create a bunch of factor data... throw some NaNs in there for good measure
data = {letter: [random.choice(list(ascii_uppercase) + [np.nan]) for _ in range(n)] for letter in ascii_uppercase}

df = pd.DataFrame(data)

我希望快速计算数据框中所有值集合中每个值的全局出现次数。

这有效:

from collections import Counter
c = Counter([v for c in df for v in df[c].fillna(-999)])

但是很慢:

%timeit Counter([v for c in df for v in df[c].fillna(-999)])
1 loop, best of 3: 4.12 s per loop

我认为这个功能可以通过使用一些熊猫的马力加快速度:

def quick_global_count(df, na_value=-999):
    df = df.fillna(na_value)
    # Get counts of each element for each column in the passed dataframe
    group_bys = {c: df.groupby(c).size() for c in df}
    # Stack each of the Series objects in `group_bys`... This is faster than reducing a bunch of dictionaries by keys
    stacked = pd.concat([v for k, v in group_bys.items()])
    # Call `reset_index()` to access the index column, which indicates the factor level for each column in dataframe
    # Then groupby and sum on that index to get global counts
    global_counts = stacked.reset_index().groupby('index').sum()
    return global_counts

它肯定更快(前一种方法的75%),但必须有更快的东西......

%timeit quick_global_count(df)
10 loops, best of 3: 3.01 s per loop

上述两种方法的结果完全相同(对quick_global_count返回的结果稍作修改):

dict(c) == quick_global_count(df).to_dict()[0]
True

在数据框中计算全局值的更快方法是什么?

3 个答案:

答案 0 :(得分:6)

方法#1

NumPy的诀窍就是转换为数字(这是NumPy发光的地方)并让bincount进行计数 -

a = df.fillna('[').values.astype(str).view(np.uint8)
count = np.bincount(a.ravel())[65:-1]

这适用于单个字符。 np.bincount(a.ravel())保留所有角色的计数。

方法#1S(超级收费)

以前的方法在字符串转换时存在瓶颈:astype(str)。此外,fillna()是另一个显示阻止者。通过绕过这些瓶颈,需要更多的技巧来超级充电。现在,可以预先使用astype('S1')来强制一切为单个字符。因此,单个字符保持不变,而NaN只减少为单个字符'n'。这样我们就可以跳过fillna,因为稍后可以通过索引轻松跳过'n'的计数。

因此,实施将是 -

def app1S(df):
    ar = df.values.astype('S1')
    a = ar.view(np.uint8)
    count = np.bincount(a.ravel())[65:65+26]
    return count

pandas-0.20.3numpy-1.13.3上的计时 -

In [3]: # Setup input
   ...: random.seed(100)
   ...: n = 1000000
   ...: data = {letter: [random.choice(list(ascii_uppercase) + 
   ...:         [np.nan]) for _ in range(n)] for letter in ascii_uppercase}
   ...: df = pd.DataFrame(data)
   ...: 

# @Wen's soln
In [4]: %timeit df.melt().value.value_counts()
1 loop, best of 3: 2.5 s per loop

# @andrew_reece's soln
In [5]: %timeit df.apply(pd.value_counts).sum(axis=1)
1 loop, best of 3: 2.14 s per loop

# Super-charged one
In [6]: %timeit app1S(df)
1 loop, best of 3: 501 ms per loop

通用案例

我们还可以np.unique来涵盖一般情况(包含多个单字符的数据) -

unq, count = np.unique(df.fillna(-999), return_counts=1)

答案 1 :(得分:5)

df.apply(pd.value_counts).sum(axis=1)

基准:

# example data
N = 10000000
rownum = int(N/1000.)
colnum = int(N/10000.)

str_vals = ['A','B','C','D']
str_data = np.random.choice(str_vals, size=N).reshape(rownum, colnum)
str_df = pd.DataFrame(str_data)

num_vals = [1,2,3,4]
num_data = np.random.choice(num_vals, size=N).reshape(rownum, colnum)
num_df = pd.DataFrame(num_data)

num_df.shape 
# (10000, 1000)

%%timeit
num_df.apply(pd.value_counts).sum(axis=1)
# 1 loop, best of 3: 883 ms per loop

%%timeit
str_df.apply(pd.value_counts).sum(axis=1)
# 1 loop, best of 3: 2.76 s per loop

答案 2 :(得分:4)

melt然后value_counts(PS,仍无法与numpy解决方案进行比较)

 df.melt().value.value_counts()

时间

%timeit df.melt().value.value_counts()
100 loops, best of 3: 1.43 ms per loop
%timeit c = Counter([v for c in df for v in df[c].fillna(-999)])
100 loops, best of 3: 5.23 ms per loop
%timeit df.apply(pd.value_counts).sum()
100 loops, best of 3: 18.5 ms per loop