基于xarray数据集的分组方法

时间:2017-11-03 17:57:32

标签: pandas-groupby quantile python-xarray

我有一个经典的xarray数据集。这些是月度数据(每月数据38年)。

我有兴趣分别计算每个月的分位数值。

<xarray.Dataset>
Dimensions:        (lat: 26, lon: 71, time: 456)
Coordinates:
  * lat            (lat) float32 25.0 26.0 27.0 28.0 29.0 30.0 31.0 32.0 ...
  * lon            (lon) float32 -130.0 -129.0 -128.0 -127.0 -126.0 -125.0 ...
  * time           (time) datetime64[ns] 1979-01-31 1979-02-28 1979-03-31 ...
Data variables:
    var1         (time, lat, lon) float32 nan nan nan nan nan nan nan nan ...
    var2         (time, lat, lon) float32 nan nan nan nan nan nan nan nan ...
    var3         (time, lat, lon) float32 nan nan nan nan nan nan nan nan ...
    ......

例如,如果我想要每月使用的均值:

ds.groupby(‘time.month’).mean(dim=‘time’)

但如果我尝试

ds.groupby(‘time.month’).quantile(0.75, dim=‘time’)

我得到了

AttributeError: 'DatasetGroupBy' object has no attribute 'quantile'

但是,基于Pandas文档,分位数适用于groupby对象。

事实上,我尝试了以下内容:

df_ds = xr.Dataset.to_dataframe(ds)
df_ds = df_ds.reset_index()
df_ds = df_ds.set_index('time')
df_ds.groupby(pd.TimeGrouper(freq='M')).quantile(0.75)

它有效;当然这是一个更简单的例子,因为我只有一个索引,事实上如果我不将reset_index / set_index复制到一个索引,我就会从pandas中得到一个错误,它无法处理多索引。

那么,xarray能做到吗?也许使用一些apply / lambda组合?

我发现了一种非常优雅的方式。这是可行的,因为我有4个变量(我可以查看变量名称,但我不在这里):

Data_clim_monthly_75g = ds.where(iok_conus_xarray).groupby('time.month')
Data_clim_monthly_75 = ds.where(iok_conus_xarray).groupby('time.month').mean(dim='time')

v1 = Data_clim_monthly_75['var1'].values
v2 = Data_clim_monthly_75['var2'].values
v3 = Data_clim_monthly_75['var3'].values
v4 = Data_clim_monthly_75['var4'].values
for k, gp in Data_clim_monthly_75g:
    v1[k-1] =  np.nanpercentile(gp['var1'].values,q=75,axis=0)
    v2[k-1] =  np.nanpercentile(gp['var2'].values,q=75,axis=0)
    v3[k-1] =  np.nanpercentile(gp['var3'].values,q=75,axis=0)
    v4[k-1] =  np.nanpercentile(gp['var4'].values,q=75,axis=0)
Data_clim_monthly_75['var1'] = (('month','lat','lon'),v1)    
Data_clim_monthly_75['var2'] = (('month','lat','lon'),v2)    
Data_clim_monthly_75['var3'] = (('month','lat','lon'),v3)    
Data_clim_monthly_75['var4'] = (('month','lat','lon'),v4)    

我基本上是围绕xarray工作的。我仍然喜欢xarray中的解决方案。

1 个答案:

答案 0 :(得分:4)

我们尚未将分位数方法添加到groupby对象中。但是,您可以使用reduce方法将任意reduce函数应用于每个组。在下面的示例中,我将np.nanpercentile应用于每个组。

In [21]: ds
Out[21]:
<xarray.Dataset>
Dimensions:  (lat: 71, lon: 26, time: 456)
Coordinates:
  * time     (time) datetime64[ns] 1979-01-31 1979-02-28 1979-03-31 ...
Dimensions without coordinates: lat, lon
Data variables:
    var1     (time, lon, lat) float64 0.4286 0.4032 0.2178 0.7652 0.8108 ...
    var2     (time, lon, lat) float64 0.8259 0.3625 0.6556 0.7403 0.2381 ...

In [22]: ds.groupby('time.month').reduce(np.nanpercentile, dim='time', q=0.75)
Out[22]:
<xarray.Dataset>
Dimensions:  (lat: 71, lon: 26, month: 12)
Coordinates:
  * month    (month) int64 1 2 3 4 5 6 7 8 9 10 11 12
Dimensions without coordinates: lat, lon
Data variables:
    var1     (month, lon, lat) float64 0.04153 0.03099 0.07881 0.01749 ...
    var2     (month, lon, lat) float64 0.03518 0.06896 0.01287 0.025 0.01536 ...