熊猫:找出明智的频繁价值

时间:2017-10-03 06:05:35

标签: python pandas

我有一个包含二进制值的数据集。我想在每一行中找出频繁的价值。该数据集有数百万条记录。最有效的方法是什么?以下是数据集的示例。

import pandas as pd
data = pd.read_csv('myData.csv', sep = ',')
data.head()
bit1    bit2    bit2    bit4    bit5    frequent    freq_count
0       0       0       1       1       0           3
1       1       1       0       0       1           3
1       0       1       1       1       1           4

我想创建frequent以及freq_count列,如上例所示。这些不是原始数据集的一部分,将在查看所有行后创建。

2 个答案:

答案 0 :(得分:2)

您可以使用scipy.stats.mode

from scipy import stats

a = df.values.T
b = stats.mode(a)
print(b)
ModeResult(mode=array([[0, 1, 1]], dtype=int64), count=array([[3, 3, 4]]))

df['frequent'] = b[0][0]
df['freq_count'] = b[1][0]
print (df)
   bit1  bit2  bit2.1  bit4  bit5  frequent  freq_count
0     0     0       0     1     1         0           3
1     1     1       1     0     0         1           3
2     1     0       1     1     1         1           4

使用Counter.most_common

from collections import Counter

def f(x):
    a, b = Counter(x).most_common(1)[0]
    return pd.Series([a, b])

df[['frequent','freq_count']] = df.apply(f, axis=1)

另一种解决方案:

def f(x):
    counts = np.bincount(x)
    a = np.argmax(counts)
    b = np.max(counts)
    return pd.Series([a,b])

df[['frequent','freq_count']] = df.apply(f, axis=1)

替代:

from collections import defaultdict

def f(x):
    d = defaultdict(int)
    for i in x:
        d[i] += 1
    return pd.Series(sorted(d.items(), key=lambda x: x[1], reverse=True)[0])


df[['frequent','freq_count']] = df.apply(f, axis=1)

<强>计时

np.random.seed(100)
N = 10000
#[10000 rows x 20 columns]
df = pd.DataFrame(np.random.randint(2, size=(N,20)))

In [140]: %timeit df.apply(f1, axis=1)
1 loop, best of 3: 1.78 s per loop

In [141]: %timeit df.apply(f2, axis=1)
1 loop, best of 3: 1.66 s per loop

In [142]: %timeit df.apply(f3, axis=1)
1 loop, best of 3: 1.7 s per loop

In [143]: %timeit mod(df)
100 loops, best of 3: 8.37 ms per loop

In [144]: %timeit mod1(df)
100 loops, best of 3: 8.88 ms per loop
from collections import Counter
from collections import defaultdict
from scipy import stats

def f1(x):
    a, b = Counter(x).most_common(1)[0]
    return pd.Series([a, b])

def f2(x):
    counts = np.bincount(x)
    a = np.argmax(counts)
    b = np.max(counts)
    return pd.Series([a,b])

def f3(x):
    d = defaultdict(int)
    for i in x:
        d[i] += 1
    return pd.Series(sorted(d.items(), key=lambda x: x[1], reverse=True)[0])

def mod(df):
    a = df.values.T
    b = stats.mode(a)

    df['a'] = b[0][0]
    df['b'] = b[1][0]
    return df

def mod1(df):
    a = df.values
    b = stats.mode(a, axis=1)

    df['a'] = b[0][:, 0]
    df['b'] = b[1][:, 0]
    return df

答案 1 :(得分:2)

这是一种方法 -

def freq_stat(df):
    a = df.values
    zero_c = (a==0).sum(1)
    one_c = a.shape[1] - zero_c
    df['frequent'] = (zero_c<=one_c).astype(int)
    df['freq_count'] = np.maximum(zero_c, one_c)
    return df

示例运行 -

In [305]: df
Out[305]: 
   bit1  bit2  bit2.1  bit4  bit5
0     0     0       0     1     1
1     1     1       1     0     0
2     1     0       1     1     1

In [308]: freq_stat(df)
Out[308]: 
   bit1  bit2  bit2.1  bit4  bit5  frequent  freq_count
0     0     0       0     1     1         0           3
1     1     1       1     0     0         1           3
2     1     0       1     1     1         1           4

基准

让我们针对来自@jezrael's soln的最快方法测试这个:

from scipy import stats

def mod(df): # @jezrael's best soln 
    a = df.values.T
    b = stats.mode(a)

    df['a'] = b[0][0]
    df['b'] = b[1][0]
    return df

另外,让我们使用其他帖子中的相同设置并获取时间 -

In [323]: np.random.seed(100)
     ...: N = 10000
     ...: #[10000 rows x 20 columns]
     ...: df = pd.DataFrame(np.random.randint(2, size=(N,20)))
     ...: 

# @jezrael's soln 
In [324]: %timeit mod(df)
100 loops, best of 3: 5.92 ms per loop

# Proposed in this post
In [325]: %timeit freq_stat(df)
1000 loops, best of 3: 496 µs per loop