Question

我希望在xarray上执行多年来的平均值（和分位数）。

如果时间采样是天数的倍数，我可以轻松做到这样的事情：

arr.groupby('time.dayofyear').mean('time')

但如果我还有几个小时的话，我找不到一个简单的方法来做同样的事情。（现在我用了一个可怕的伎俩）。

例如在这种情况下：

import numpy as np
import pandas as pd
import xarray as xr

time = pd.date_range('2000-01-01', '2010-01-01', freq='6h')
arr = xr.DataArray(
     np.ones(len(time)), 
     dims='time', 
     coords={'time' : ('time', time)}
)

可能我错过了什么，我对熊猫和xarray并不是很专业。你有一些提示吗？

非常感谢。

Answer 1

对于每日平均值，我建议使用重新采样功能。如果我正确理解了这个问题，这应该会给你每日平均值。然后，您可以使用这些每日平均值来计算每年的操作组。

import numpy as np
import pandas as pd
import xarray as xr

time = pd.date_range('2000-01-01', '2010-01-01', freq='6h')
arr = xr.DataArray(
     np.ones(len(time)), 
     dims='time', 
     coords={'time' : ('time', time)}
)

daily = arr.resample(time='D').mean('time')

Answer 2

如果您想要每日平均值，resample是此工作的最佳工具：

daily = arr.resample(time='D').mean('time')

然后，您可以使用groupby计算一年中每一天的分位数：

quantiles_by_dayofyear = daily.groupby('time.dayofyear').apply(
    xr.DataArray.quantile, q=[0.25, 0.5, 0.75])

print(quantiles_by_dayofyear)

收率：

<xarray.DataArray (dayofyear: 366, quantile: 3)>
array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       ...,
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])
Coordinates:
  * quantile   (quantile) float64 0.25 0.5 0.75
  * dayofyear  (dayofyear) int64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ...

我们应该将分位数方法添加到xarray的groupby reduce方法列表中，但这应该适用于现在。

Answer 3

抱歉，我的问题可能不明确。只考虑分位数。我的预期输出是这样的：

<xarray.DataArray (hours: 1464, quantile: 3)>
array([[1., 1., 1.],
      [1., 1., 1.],
      [1., 1., 1.],
      ...,
      [1., 1., 1.],
      [1., 1., 1.],
      [1., 1., 1.]])
Coordinates:
* quantile   (quantile) float64 0.25 0.5 0.75
* hours  (hours) int64 6 12 18 24 30 36 42 48 54 60 66 72 ...

从小时开始的小时数。但是，不过几小时，它也可能是好的，就像具有白天和小时（白天）的多指数。我有一个棘手的方法（执行一些重新索引与多索引和取消时间维度），但它真的很糟糕。我认为这样做更容易，更优雅。

非常感谢。

Answer 4

我对这个问题的理解是，您要么希望能够同时对两个变量进行groupby操作，要么希望通过xarray DateTimeAccessor的方法而不是groupby进行操作。

您可能正在使用xarray.apply_ufunc。以下是一些我用于按年份和月份进行分组均值的代码。

def _grouped_mean(
            data: np.ndarray,
            months: np.ndarray,
            years: np.ndarray) -> np.ndarray:
        """similar to grouping year_month MultiIndex, but faster.

        Should be used wrapped by _wrapped_grouped_mean"""
        unique_months = np.sort(np.unique(months))
        unique_years = np.sort(np.unique(years))
        old_shape = list(data.shape)
        new_shape = old_shape[:-1]
        new_shape.append(unique_months.shape[0])
        new_shape.append(unique_years.shape[0])

        output = np.zeros(new_shape)

        for i_month, j_year in np.ndindex(output.shape[2:]):
            indices = np.intersect1d(
                (months == unique_months[i_month]).nonzero(),
                (years == unique_years[j_year]).nonzero()
            )

            output[:, :, i_month, j_year] =\
                np.mean(data[:, :, indices], axis=-1)

        return output

def _wrapped_grouped_mean(da: xr.DataArray) -> xr.DataArray:
        """similar to grouping by a year_month MultiIndex, but faster.

        Wraps a numpy-style function with xr.apply_ufunc
        """
        Y = xr.apply_ufunc(
            _grouped_mean,
            da,
            da.time.dt.month,
            da.time.dt.year,
            input_core_dims=[['lat', 'lon', 'time'], ['time'], ['time']],
            output_core_dims=[['lat', 'lon', 'month', 'year']],
        )
        Y = Y.assign_coords(
            {'month': np.sort(np.unique(da.time.dt.month)),
             'year': np.sort(np.unique(da.time.dt.year))})
        return Y

与xarray一起运作

4 个答案: