我正在寻找一种方法来从包含低频项的数据框中删除行。我从this帖子中改编了以下代码段:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
columns = ['A', 'B'])
threshold = 10 # Anything that occurs less than this will be removed.
value_counts = df.stack().value_counts() # Entire DataFrame
to_remove = value_counts[value_counts <= threshold].index
df.replace(to_remove, np.nan, inplace=True)
问题是,此代码似乎无法缩放。
行to_remove = value_counts[value_counts <= threshold].index
已经为我的数据运行了几个小时(2 GB压缩HDFStore)。因此,我需要一个更好的解决方案。理想情况下超出核心。我怀疑dask.dataframe
是合适的,但是我无法以敏捷的方式表达以上代码。 stack
缺少关键功能replace
和dask.dataframe
。
我尝试了以下方法(在普通大熊猫中工作)来解决缺少这两个功能的问题:
value_countss = [df[col].value_counts() for col in df.columns]
infrequent_itemss = [value_counts[value_counts < 3] for value_counts in value_countss]
rows_to_drop = set(i for indices in [df.loc[df[col].isin(infrequent_items.keys())].index.values for col, infrequent_items in zip(df.columns, infrequent_itemss)] for i in indices)
df.drop(rows_to_drop)
但是,这实际上对dask无效。它在infrequent_items.keys()
处出错。
即使它确实奏效了,但考虑到这与雅致相反,我怀疑一定有更好的方法。
你能建议点什么吗?
答案 0 :(得分:1)
不确定这是否对您有所帮助,但太大了,无法发表评论:
df = pd.DataFrame(np.random.randint(0, high=20, size=(30,2)), columns = ['A', 'B'])
unique, counts = np.unique(df.values.ravel(), return_counts=True)
d = dict(zip(unique, counts))
threshold = 10
to_remove = [k for k, v in d.items() if v < threshold]
df.replace(to_remove, np.nan, inplace=True)
请参阅:
How to count the occurrence of certain item in an ndarray in Python?
how to count occurrence of each unique value in pandas
玩具问题表明,在您提到的步骤中,从400 us加速到10 us的速度提高了40倍。
答案 1 :(得分:1)
下面的代码结合了Evan的改进,解决了我的问题:
unique, counts = np.unique(df.values.ravel(), return_counts=True)
d = dict(zip(unique, counts))
to_remove = {k for k, v in d.items() if v < threshold}
mask = df.isin(to_remove)
column_mask = (~mask).all(axis=1)
df = df[column_mask]
演示:
def filter_low_frequency(df, threshold=4):
unique, counts = np.unique(df.values.ravel(), return_counts=True)
d = dict(zip(unique, counts))
to_remove = {k for k, v in d.items() if v < threshold}
mask = df.isin(to_remove)
column_mask = (~mask).all(axis=1)
df = df[column_mask]
return df
df = pd.DataFrame(np.random.randint(0, high=20, size=(10,10)))
print(df)
print(df.stack().value_counts())
df = filter_low_frequency(df)
print(df)
0 1 2 3 4 5 6 7 8 9
0 3 17 11 13 8 8 15 14 7 8
1 2 14 11 3 16 10 19 19 14 4
2 8 13 13 17 3 13 17 18 5 18
3 7 8 14 9 15 12 0 15 2 19
4 6 12 13 11 16 6 19 16 2 17
5 2 1 2 17 1 3 12 10 2 16
6 0 19 9 4 15 3 3 3 4 0
7 18 8 15 9 1 18 15 17 9 0
8 17 15 9 11 13 9 11 4 19 8
9 13 6 7 8 8 10 0 3 16 13
8 9
3 8
13 8
17 7
15 7
19 6
2 6
9 6
11 5
16 5
0 5
18 4
4 4
14 4
10 3
12 3
7 3
6 3
1 3
5 1
dtype: int64
0 1 2 3 4 5 6 7 8 9
6 0 19 9 4 15 3 3 3 4 0
8 17 15 9 11 13 9 11 4 19 8