我正在尝试计算pandas DataFrame
中的重复行数。我从一个看起来像这个
feature, IV, IT
early/J_result/N, True, False
early/J_result/N, True, False
early/J_result/N, True, False
excellent/J_result/N, True, True
hillsdown/N, True, False
hillsdown/N, True, False
上面示例输入的所需输出是
feature, IV, IT, count
early/J_result/N, True, False, 3
excellent/J_result/N, True, True, 1
hillsdown/N, True, False, 2
我现在的代码是:
import pandas as pd
def sum_up_token_counts(hdf_file):
df = pd.read_csv(csv_file, sep=', ')
counts = df.groupby('feature').count().feature
assert counts.sum() == df.shape[0] # no missing rows
df = df.drop_duplicates()
df.set_index('feature', inplace=True)
df['count'] = counts
return df
这可以按预期工作,但需要很长时间。我描述了它,看起来几乎所有的时间都花在分组和计数上。
Total time: 4.43439 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
28
29 1 57567 57567.0 1.3 df = pd.read_csv(hdf_file, sep=', ')
30 1 4368529 4368529.0 98.5 counts = df.groupby('feature').count().feature
31 1 174 174.0 0.0 assert counts.sum() == df.shape[0] # no missing rows
32 1 6234 6234.0 0.1 df = df.drop_duplicates()
33 1 501 501.0 0.0 df.set_index('feature', inplace=True)
34 1 1377 1377.0 0.0 df['count'] = counts
35 1 1 1.0 0.0 return df
有什么想法可以加快这段代码的速度吗?
答案 0 :(得分:2)
使用master / 0.14(即将推出),大大加快了计数,请参阅here
这是master / 0.14 vs 0.13.1的基准:
设置
In [1]: n = 10000
In [2]: offsets = np.random.randint(n, size=n).astype('timedelta64[ns]')
In [3]: dates = np.datetime64('now') + offsets
In [4]: dates[np.random.rand(n) > 0.5] = np.datetime64('nat')
In [5]: offsets[np.random.rand(n) > 0.5] = np.timedelta64('nat')
In [6]: value2 = np.random.randn(n)
In [7]: value2[np.random.rand(n) > 0.5] = np.nan
In [8]: obj = pd.util.testing.choice(['a', 'b'], size=n).astype(object)
In [9]: obj[np.random.randn(n) > 0.5] = np.nan
In [10]: df = DataFrame({'key1': np.random.randint(0, 500, size=n),
....: 'key2': np.random.randint(0, 100, size=n),
....: 'dates': dates,
....: 'value2' : value2,
....: 'value3' : np.random.randn(n),
....: 'obj': obj,
....: 'offsets': offsets})
v0.13.1
In [11]: %timeit df.groupby(['key1', 'key2']).count()
1 loops, best of 3: 5.41 s per loop
v0.14.0
In [11]: %timeit df.groupby(['key1', 'key2']).count()
100 loops, best of 3: 6.25 ms per loop