ValueError:数据重叠。在python中

时间:2019-10-01 23:22:12

标签: python pandas dataframe

我有一个像这样的数据框 df3

列长度为 AAA _ ??? 的未知列可以是数据集中的任何内容

           Date    ID  Calendar_Year Month   DayName...  AAA_1E AAA_BMITH  AAA_4.1  AAA_CH
0    2019-09-17  8661           2019   Sep       Sun...     NaN       NaN      NaN     NaN
1    2019-09-18  8662           2019   Sep       Sun...     1.0       3.0     34.0     1.0
2    2019-09-19  8663           2019   Sep       Sun...     NaN       NaN      NaN     NaN
3    2019-09-20  8664           2019   Sep       Mon...     NaN       NaN      NaN     NaN
4    2019-09-20  8664           2019   Sep       Mon...     2.0       4.0     32.0     3.0
5    2019-09-20  8664           2019   Sep       Sat...     NaN       NaN      NaN     NaN
6    2019-09-20  8664           2019   Sep       Sat...     NaN       NaN      NaN     NaN
7    2019-09-20  8664           2019   Sep       Sat...     0.0       4.0     30.0     0.0

另一个数据框 dfMeans ,其平均值为第三个数据框

     Month Dayname           ID  ...  AAA_BMITH    AAA_4.1  AAA_CH
0      Jan     Thu  7686.500000  ...   0.000000  28.045455     0.0
1      Jan     Fri  7636.272727  ...   0.000000  28.136364     0.0
2      Jan     Sat  7637.272727  ...   0.000000  27.045455     0.0
3      Jan     Sun  7670.090909  ...   0.000000  27.090909     0.0
4      Jan     Mon  7702.909091  ...   0.000000  27.727273     0.0
5      Jan     Tue  7734.260870  ...   0.000000  27.956522     0.0

数据帧将由月份日名

我想用dfMean中的值替换df3中的NaN

使用此行

df3.update(dfMeans, overwrite=False, errors="raise")

但是我得到这个错误

  

引发ValueError(“数据重叠。”)

     

ValueError:数据重叠。

如何使用dfMean中的值更新NaN并避免此错误?

编辑:

我已将所有数据框放在一个数据框 df

     Month Dayname           ID  ...  AAA_BMITH    AAA_4.1  AAA_CH
0      Jan     Thu  7686.500000  ...   0.000000  28.045455     0.0
1      Jan     Fri  7636.272727  ...   0.000000  28.136364     0.0
2      Jan     Sat  7637.272727  ...   0.000000  27.045455     0.0
3      Jan     Sun  7670.090909  ...   0.000000  27.090909     0.0
4      Jan     Mon  7702.909091  ...   0.000000  27.727273     0.0
5      Jan     Tue  7734.260870  ...   0.000000  27.956522     0.0

如何用月份日名的平均值填充NaN?

2 个答案:

答案 0 :(得分:2)

使用fillna

数据:

       Date    ID  Calendar_Year Month Dayname  AAA_1E  AAA_BMITH  AAA_4.1  AAA_CH
 2019-09-17  8661           2019   Jan     Sun     NaN        NaN      NaN     NaN
 2019-09-18  8662           2019   Jan     Sun     1.0        3.0     34.0     1.0
 2019-09-19  8663           2019   Jan     Sun     NaN        NaN      NaN     NaN
 2019-09-20  8664           2019   Jan     Mon     NaN        NaN      NaN     NaN
 2019-09-20  8664           2019   Jan     Mon     2.0        4.0     32.0     3.0
 2019-09-20  8664           2019   Jan     Sat     NaN        NaN      NaN     NaN
 2019-09-20  8664           2019   Jan     Sat     NaN        NaN      NaN     NaN
 2019-09-20  8664           2019   Jan     Sat     0.0        4.0     30.0     0.0

df.set_index(['Month', 'Dayname'], inplace=True)

enter image description here

df_mean:

Month Dayname           ID  AAA_BMITH    AAA_4.1  AAA_CH
  Jan     Thu  7686.500000        0.0  28.045455     0.0
  Jan     Fri  7636.272727        0.0  28.136364     0.0
  Jan     Sat  7637.272727        0.0  27.045455     0.0
  Jan     Sun  7670.090909        0.0  27.090909     0.0
  Jan     Mon  7702.909091        0.0  27.727273     0.0
  Jan     Tue  7734.260870        0.0  27.956522     0.0

df_mean.set_index(['Month', 'Dayname'], inplace=True)

enter image description here

更新df

  • 此操作基于匹配的索引值
  • 它不能一次使用多个列名,您必须获取感兴趣的列并对其进行遍历
  • 请注意,AAA_1E不在df_mean
for col in df.columns:
    if col in df_mean.columns:
        df[col].fillna(df_mean[col], inplace=True)

enter image description here

答案 1 :(得分:1)

您可以在groupby'Month'DayName',然后使用apply编辑数据框。
使用fillna来填充Nan值。 fillna接受字典作为value参数:字典的键是列名,值是标量:标量用于替换每一列中的Nan。使用loc,您可以从dMeans中选择适当的值。 您可以使用df3dfMeans的列之间的交集,用dict理解来创建字典。

所有这些都对应以下语句:

df3filled = df3.groupby(['Month', 'DayName']).apply(lambda x : x.fillna(
    {col : dfMeans.loc[(dfMeans['Month'] == x.name[0]) & (dfMeans['Dayname'] == x.name[1]), col].iloc[0]
    for col in x.columns.intersection(dfMeans.columns)})).reset_index(drop=True)