所以我有一个像这样的pandas DataFrame:
r vals positions
1.2 1
1.8 2
2.3 1
1.8 1
2.1 3
2.0 3
1.9 1
... ...
我希望按位置过滤掉所有不至少出现20次的行。我见过这样的事情
g=df.groupby('positions')
g.filter(lambda x: len(x) > 20)
但这似乎不起作用,我不明白如何从中获取原始数据帧。在此先感谢您的帮助。
答案 0 :(得分:28)
在您的有限数据集中,以下工作:
In [125]:
df.groupby('positions')['r vals'].filter(lambda x: len(x) >= 3)
Out[125]:
0 1.2
2 2.3
3 1.8
6 1.9
Name: r vals, dtype: float64
您可以指定此过滤器的结果,并将其与isin
一起使用来过滤您的原始数据:
In [129]:
filtered = df.groupby('positions')['r vals'].filter(lambda x: len(x) >= 3)
df[df['r vals'].isin(filtered)]
Out[129]:
r vals positions
0 1.2 1
1 1.8 2
2 2.3 1
3 1.8 1
6 1.9 1
您只需在案例中将3
更改为20
另一种方法是使用value_counts
创建聚合系列,然后我们可以使用它来过滤你的df:
In [136]:
counts = df['positions'].value_counts()
counts
Out[136]:
1 4
3 2
2 1
dtype: int64
In [137]:
counts[counts > 3]
Out[137]:
1 4
dtype: int64
In [135]:
df[df['positions'].isin(counts[counts > 3].index)]
Out[135]:
r vals positions
0 1.2 1
2 2.3 1
3 1.8 1
6 1.9 1
修改强>
如果要在数据框而不是系列上过滤groupby对象,则可以直接在groupby对象上调用filter
:
In [139]:
filtered = df.groupby('positions').filter(lambda x: len(x) >= 3)
filtered
Out[139]:
r vals positions
0 1.2 1
2 2.3 1
3 1.8 1
6 1.9 1
答案 1 :(得分:1)
我喜欢以下方法:
def filter_by_freq(df: pd.DataFrame, column: str, min_freq: int) -> pd.DataFrame:
"""Filters the DataFrame based on the value frequency in the specified column.
:param df: DataFrame to be filtered.
:param column: Column name that should be frequency filtered.
:param min_freq: Minimal value frequency for the row to be accepted.
:return: Frequency filtered DataFrame.
"""
# Frequencies of each value in the column.
freq = df[column].value_counts()
# Select frequent values. Value is in the index.
frequent_values = freq[freq >= min_freq].index
# Return only rows with value frequency above threshold.
return df[df[column].isin(frequent_values)]
它比公认的答案中的filter lambda方法要快得多-python的开销已降至最低。
答案 2 :(得分:0)
如何选择值为> = 20
的所有position
行
mask = df['position'] >= 20
sel = df.ix[mask, :]
答案 3 :(得分:0)
counts = df.position.value_counts(dropna=False)
df = df[df.positions.isin(counts[counts.isin(list(range(20,counts.max())))])]
这个解决方案更可取,因为它的计算时间效率与答案的“长期价值”相反:
CPU times: user 2.1 ms, sys: 485 µs, total: 2.58 ms Wall time: 20.3 ms
VS
CPU times: user 15.2 ms, sys: 11.7 ms, total: 26.9 ms Wall time: 156 m