用group by填充来自另一个数据框的NaN

时间:2019-10-01 06:52:08

标签: python dataframe

我有2个数据框

第一个看起来像这样

Month DayOfWeek  Class A1  A2 ... A999
July  Monday     Bata  7   9  ... 5
July  Tuesay     Bata  3   1  ... 2
July  Sunday     Bata  4   5  ... 6
July  Monday     Adid  9   8  ... 5
July  Sunday     Adid  4   0  ... 4
Sept  Monday     Nike  7   5  ... 7
Sept  Sunday     Nike  8   3  ... 7
Sept  Satday     Adid  2   7  ... 7
Sept  Monday     Bata  8   9  ... 4
Oct   Monday     Nike  4   2  ... 5
Oct   Sunday     Bata  8   6  ... 3

我的第二个数据帧看起来像这样

Month DayOfWeek  Class A1  A2 ... A999
Jul   Monday     Bata  5   7      8
Oct   Monday     Adid  1   2      3
Sep   Monday     Bata  3   7      6
Sep   Monday     Nike  8   3      8
Jul   Monday     Adid  NaN NaN    NaN
Sep   Sunday     Nike  NaN NaN    NaN
Oct   Satday     Nike  NaN NaN    NaN
Sep   Monday     Bata  NaN NaN    NaN

第一个称为df1的数据帧没有NaN 第二个数据帧df2中几乎有一半是A1至A999列中的NaN

列数是可变的,可能是从A1到A10或从A1到A2567

它可以是任意数量的列

我想用df1中的相同月份和DayOfWeek的平均值来填充df2中的这些NaN

我之前发布了另一个问题,但是情况已经改变,它已分为2个数据框和未知的列数

到目前为止,我已经做到了

Mth = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
Wk = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
for m in Mth:
    for w in Wk:
        print(w,m, df[(df["Month"]==m) & (df["DayOfWeek"]==w) ].mean())

我不知道要去哪里,怎么不指定要应用于所有列的列名

Month DayOfWeek  Class A1  A2 ... A999
Jul   Monday     Bata  5   7      8
Oct   Monday     Adid  1   2      3
Sep   Monday     Bata  3   7      6
Sep   Monday     Nike  8   3      8
Jul   Monday     Adid  NaN NaN    NaN  <--- Avg of Monday Jul in df1 for each column
Sep   Sunday     Nike  NaN NaN    NaN  <--- Avg of Sunday Sep in df1 for each column
Oct   Satday     Nike  NaN NaN    NaN  <--- Avg of Satday Oct in df1 for each column
Sep   Monday     Bata  NaN NaN    NaN  <--- Avg of Monday Sep in df1 for each column

该怎么做?

2 个答案:

答案 0 :(得分:1)

我认为这可能有效:

  result = pd.concat([df1, df2]).groupby(['Month','DayOfWeek','Class'], as_index=False,axis=0).mean().dropna()

输出类似于:

     Month DayOfWeek Class   A1   A2  A999
 2   July    Monday  Adid  9.0  8.0   5.0
 3   July    Monday  Bata  7.0  9.0   5.0
 4   July    Sunday  Adid  4.0  0.0   4.0
 5   July    Sunday  Bata  4.0  5.0   6.0
 6   July   Tuesday  Bata  3.0  1.0   2.0
 8    Oct    Monday  Nike  4.0  2.0   5.0

使用concat可以合并数据帧。我想您想按Month,DayOfWeek和Class分组。这段代码“ as_index = False,axis = 0”使您可以混合使用不同列大小的数据帧。 当按“月,星期几和班级”分组时,它将创建所有可能的列:

       Month DayOfWeek Class   A1   A2  A999
  0    Jul    Monday  Adid    NaN  NaN   NaN  

在这种特殊情况下,没有数据,也没有印刷兴趣,解决方案是在末尾添加dropna()。

希望对您有帮助。

答案 1 :(得分:1)

您可以使用如下所示的分组,合并和更新功能

生成虚拟数据

Mth = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
Wk = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]

def generate(nan=False):

    values = np.random.rand(20,20)
    if nan:
        nan_mask = np.random.choice([False,False,True], (20,20))
        values[nan_mask] = np.nan

    df = pd.DataFrame(values, columns = [f"A{i}" for i in range(values.shape[1])])
    df_ = pd.DataFrame()
    df_["Month"] = np.random.choice(Mth,20)
    df_["DayOfWeek"] = np.random.choice(Wk,20)

    df = pd.concat([df_, df], sort=False, axis=1)


    return df

df1 = generate()
df2 = generate(True)

解决方案 首先为每个组合计算均值,然后将均值与原始数据索引合并,然后使用均值更新原始数据

means = df1.groupby(["Month", "DayOfWeek"]).mean().reset_index()
means = df1[["Month", "DayOfWeek"]].merge(means, how="left", on=["Month", "DayOfWeek"])

display(df2)
df3=df2.copy()
df3.update(means, overwrite=False)
display(df3)