Question

假设我有一个包含熊猫因素的数据集，我有因子'A'到'Z'，并假设因子'A'，'B'和'C'有30个观察值而其余因子只有5个。是这个数据框中的其他列，但我只关注这一系列因素（让我们称之为factor1）。

我使用什么操作与pandas过滤此数据帧，以便数据框中的唯一行是那些具有超过20个观察值的因子？如果我想要数据框中factor1的前三个最受欢迎的因素，我会使用什么操作？

编辑：这是一组有限的代码

data = {'factor1':['A','A','A', 'B', 'B', 'B', 'C','C', 'D'], 'factor2':['apple','apple','apple','apple','apple','apple','orange','orange','orange'], 'response':range(9)}
df = pandas.DataFrame(data)

如何过滤df以使factor1具有频率大于5（或n或其他任何）的最受欢迎的3大因素或因素

Answer 1

尝试使用前3个最受欢迎的因素：

N = 3
handy = df.groupby('factor1')['factor1'].count()
handy.sort('factor1',ascending=False)
topNFactors = handy.head(N)
print topNFactors

dataOfTopNFactors = df[df['factor1'].map(lambda x: x in topNFactors)]
print dataOfTopNFactors

或者尝试使用频率至少为2的因素：

M = 2
handy = df.groupby('factor1')['factor1'].count()
minimumValueMFactors = handy[handy>=M]
dataOfMinimumValueMFactors = df[df['factor1'].isin(minimumValueMFactors.index)]
print dataOfMinimumValueMFactors

使用pandas过滤最受欢迎因素的数据框

1 个答案: