Group By,Agg,Reset Index,其中filter返回null

时间:2017-04-12 22:54:18

标签: python pandas

这是一个样本数据集

 customer_number ethnicity fiscal_quarter  fiscal_year
1              231     Black      Quarter 1         2016
2              451     White      Quarter 1         2016
3              345     White      Quarter 1         2016

我想检查种族列的“亚洲”测试,按照financial_year,fiscal_quarter进行分组,并计算唯一的customer_number。但如果“亚洲”没有结果,请返回如下数据框。

 customer_number fiscal_quarter  fiscal_year
1                0      Quarter 1         2016

2 个答案:

答案 0 :(得分:1)

简短回答

# make column `Categorical`, include `'Asian'` as one of the categories
e = df.ethnicity
df['ethnicity'] = pd.Categorical(e, categories=np.append('Asian', e.unique()))

# simple function to be applied.  performs 2nd level of `groupby`
def f(df):
    s = df.groupby('ethnicity').customer_number.nunique()
    return s.loc['Asian']

# initial `groupby`
d = df.groupby(['fiscal_year', 'fiscal_quarter']).apply(f)

d.reset_index(name='nunique')

   fiscal_year fiscal_quarter  nunique
0         2016      Quarter 1        0

解释

  • 方便groupby方式生成groups并且不存在的汇总结果的方法是将组列定义为'Categorical',您可以在其中定义类别包括缺少的东西。 pandas将在汇总结果中包含该类别。
  • 但是,在这种情况下,我不能groupby有3个不同的列,并保持同样的便利。我不得不将分组分成2。
    1. groupby列不是'Categorical'。即['fiscal_year', 'fiscal_quarter']
    2. 在步骤1中,
    3. applygroupby,这是一个仅对groupby执行简单ethnicity的函数。 维持所需的行为并报告所有类别,无论它们是否在数据中都有表示。

保留所有类别

e = df.ethnicity
df['ethnicity'] = pd.Categorical(
    e, categories=np.append(['Asian', 'Hispanic'], e.unique()))

def f(df):
    return df.groupby('ethnicity').customer_number.nunique()

d = df.groupby(['fiscal_year', 'fiscal_quarter']).apply(f)

d.stack().reset_index(name='nunique')

   fiscal_year fiscal_quarter ethnicity  nunique
0         2016      Quarter 1     Asian        0
1         2016      Quarter 1  Hispanic        0
2         2016      Quarter 1     Black        1
3         2016      Quarter 1     White        1

答案 1 :(得分:0)

如果我理解了您正在寻找的内容,则应执行以下操作:

import pandas as pd

# Generate data
d = {'customer_number': [231, 451, 345, 236, 457, 354],
'ethnicity': ['Black', 'White', 'White', 'Black', 'White', 'White'],
'fiscal_quarter': ['Quarter 1','Quarter 1','Quarter 1','Quarter 3','Quarter 3','Quarter 1'],
'fiscal_year': [2016, 2016, 2016, 2015, 2015, 2017]}

df = pd.DataFrame(d)

# Helper function to determine subset of
# dataframe that meets ethnicity condition
def find_ethnicity(dff, ethnicity):
    count = dff.customer_number[dff.ethnicity.eq(ethnicity)].nunique()
    if count == 0:
        dff = dff.head(1).copy()
    else:
        dff = dff[dff.ethnicity.eq(ethnicity)].copy().head(1)
    dff['ethnicity'] = ethnicity
    dff['customer_number'] = count
    return dff


# Test with ethnicity 'Black' grouping by fiscal_year and fiscal_quarter
print(df.groupby(['fiscal_year', 'fiscal_quarter'], as_index=False).apply(find_ethnicity, 'Black')).reset_index(drop=True)

#    customer_number ethnicity fiscal_quarter  fiscal_year
# 0                1     Black      Quarter 3         2015
# 1                1     Black      Quarter 1         2016
# 2                0     Black      Quarter 1         2017

# Test with ethnicity 'Asian' grouping by fiscal_year and fiscal_quarter
print(df.groupby(['fiscal_year', 'fiscal_quarter'], as_index=False).apply(find_ethnicity, 'Asian')).reset_index(drop=True)

#    customer_number ethnicity fiscal_quarter  fiscal_year
# 0                0     Asian      Quarter 3         2015
# 1                0     Asian      Quarter 1         2016
# 2                0     Asian      Quarter 1         2017

# Test with ethnicity 'White' grouping by fiscal_year and fiscal_quarter
print(df.groupby(['fiscal_year', 'fiscal_quarter'], as_index=False).apply(find_ethnicity, 'White')).reset_index(drop=True)

#    customer_number ethnicity fiscal_quarter  fiscal_year
# 0                1     White      Quarter 3         2015
# 1                2     White      Quarter 1         2016
# 2                1     White      Quarter 1         2017

# Test with ethnicity 'Latino' grouping by fiscal_year and fiscal_quarter
print(df.groupby(['fiscal_year', 'fiscal_quarter'], as_index=False).apply(find_ethnicity, 'Latino')).reset_index(drop=True)

#    customer_number ethnicity fiscal_quarter  fiscal_year
# 0                0    Latino      Quarter 3         2015
# 1                0    Latino      Quarter 1         2016
# 2                0    Latino      Quarter 1         2017

# Test with ethnicity 'Asian' without grouping
print(find_ethnicity(df, 'Asian'))

#    customer_number ethnicity fiscal_quarter  fiscal_year
# 0                0     Asian      Quarter 1         2016

我希望这证明有用。