Question

我想执行拆分任务，但是每个类需要最少的样本数，因此我想通过标识类标签的列来过滤数据框。如果该类的出现频率低于某个阈值，则我们希望将其过滤掉。

>>> df = pd.DataFrame([[1,2,3], [4,5,6], [0,0,6]])
>>> df
   0  1  2
0  1  2  3
1  4  5  6
2  0  0  6

>>> filter_on_col(df, col=2, threshold=6)  # Removes first row
   0  1  2
0  4  5  6
1  0  0  6

我可以执行类似df[2].value_counts()的操作来获取列2中每个值的频率，然后可以通过以下方法找出哪些值超出了阈值：

>>>`df[2].value_counts() > 2`
 3      False
 6      True

然后找出其余部分的逻辑非常简单。

但是我觉得这里有一种优雅的Pandas内胆我可以做到，或者也许是一种更有效的方法。

我的问题与Select rows from a DataFrame based on values in a column in pandas非常相似，但是棘手的部分是我依赖值频率而不是值本身。

Answer 1

所以这是单线的：

# Assuming the parameters of your specific example posed above.
col=2; thresh=2

df[df[col].isin(df[col].value_counts().get(thresh).loc[lambda x : x].index)]

Out[303]: 
   0  1  2
1  4  5  6
2  0  0  6

或其他单线：

df[df.groupby(col)[col].transform('count')>thresh,]

根据列中值的频率选择行；单线还是更快的方式？

1 个答案: