Question

事先道歉我的“最小例子”有一个三维数组，但有必要展示我遇到的问题的全部力量（因为groupby将最终堆叠两个维度当我只打算总结其中一个时，我们在一起）：

上下文

import xarray as xr
import numpy as np
ds = xr.Dataset()

ds['kind'] = (['layer', 'qpoint'], [
    ['gamma', 'other', 'selected', 'selected', 'other', 'other'],
    ['selected', 'selected', 'other', 'gamma', 'other', 'other'],
])

# for each layer and eigenmode, we have a probability distribution
#  over the qpoints.
probs = np.random.random((2, 18, 6))
probs /= probs.sum(axis=2, keepdims=True) # sum over qpoints is 1
ds['prob'] = (['layer', 'mode', 'qpoint'], probs)

目标：

在'qpoint'的{{1}}维度上对相等ds['prob']的组执行某种求和，生成带有dims ds['kind']的DataArray。（不一定按顺序）

我最大的尝试：

我试图用['layer', 'mode', 'kind']解决这个问题，但我真的不知道自己在做什么。当我只想总结groupby维度时，我无法做出如何使用GroupBy.sum的正面或反面。（天真地呼唤qpoint也将总结所有层和模式）

最后，我尝试使用GroupBy作为迭代器，但最终我仍然遇到了试图将所有数据重新组合在一起的问题。

groupby('kind').sum()

此时，每个数组看起来像：

pairs = ds.groupby('kind')

# I couldn't make heads or tails of how to use GroupBy.sum,
#  so I tried the more familiar concept of iteration.

# focus on the 'prob' DataArrays
pairs = [(kind, d['prob']) for (kind, d) in pairs]

# unstack them so that `qpoint` is a valid dimension to sum over.
# (this densifies the arrays in the process though, producing 2x18x6 arrays
#  that are mostly filled with nan; this seems kind of backwards...)
pairs = [(kind, d.unstack('stacked_layer_qpoint')) for (kind, d) in pairs]
pairs = [(kind, d.sum(dim='qpoint')) for (kind, d) in pairs]

# 'kind' got lost when we did the sum. add it back
arrays = [d.assign_coords(kind=kind) for (kind, d) in pairs]

但即使在所有这些之后，当我试图将它们重新组合在一起时，我会收到以下错误，而且我不知道如何解决它。（我确实希望'善待'成为一个坐标，但我不知道它要我做什么！）

<xarray.DataArray 'prob' (mode: 18, layer: 2)>
array([[0.231093, 0.345689],
       ...
       [0.204913, 0.043868]])
Coordinates:
  * layer    (layer) int64 0 1
    kind     <U5 'gamma'
Dimensions without coordinates: mode

Answer 1

发布此消息前几秒钟，我尝试纯粹是一时兴起来更改此代码以使用# xarray.core.merge.MergeError: unable to determine if these variables # should be coordinates or not in the merged result: {'kind'} ds['kind-prob'] = xr.concat(arrays, dim='kind')。令我惊讶的是，我第一次尝试使用groupby(...).apply(...)似乎产生了我想要的东西：

apply

但是，如果天真地分配给result = ( ds.groupby('kind') .apply(lambda d: d['prob'] .unstack('stacked_layer_qpoint') .sum(dim='qpoint') ) ) # result has dims: (kind: 3, mode: 18, layer: 2)的新变量，则MergeError仍会失败。

~~事实证明，这个错误只是因为我将ds设置为变量而不是坐标。通过使用kind将set_coords“升级”到坐标中，可以在事后纠正这一点：~~

kind

编辑：不，这也是一个坏主意。 ds['kind-prob'] = result # this would give MergeError ds = ds.set_coords('kind') ds['kind-prob'] = result # ok now将result作为维度，但在此分配后，'kind'仅仅是坐标。也许真正的问题在这里是我试图让维度和变量共享相同的名称，这是不受支持的。

我还发现了两个可能的定义改进：

'kind'可以使用DataArray.groupby包含用于分组的值（因此无需在整个DataArray上使用groupby）
~~我可以通过对Dataset进行嵌套调用来避免unstack。我的想法是groupby每个你想要的东西作为输出数组中的维度。~~
- 这是有效的，因为原始groupby和layer坐标仍然存在;如果您检查由第一个qpoint生成的其中一个数组，您会看到它有一个维度（groupby），但有三个坐标（stacked_layer_qpoint，stacked_layer_qpoint，{{1 }}）。
- 编辑：事实证明，与使用layer的解决方案相比，嵌套的groupby-apply非常慢，大概是因为它将大量纯Python代码（lambdas）放入热循环。

这导致以下结果：

qpoint

使用{Dataset，DataArray} .group_by

1 个答案: