Question

让我们说我有以下数据框：

df_raw = pd.DataFrame({"person_id": [101, 101, 102, 102, 102, 103], "date": [0, 5, 0, 7, 11, 0], "val1": [99, 11, 22, 33, 44, 22], "val2": [77, 88, 22, 66, 55, 33]})

我要实现的是创建一个3维numpy数组，使结果应为以下内容：

np_pros = np.array([[[0, 99, 77], [5, 11, 88]], [[0, 22, 22], [7, 33, 66], [11, 44, 55]], [[0, 22, 33]]])

换句话说，3D数组应具有以下形状[unique_ids, None, feature_size]。在我的情况下，unique_ids的数目为3，feature size的数目为3（person_id以外的所有列），而y列的长度可变，它表示person_id的测量数量。

我很清楚，我可以创建一个np.zeros((unique_ids, max_num_features, feature_size))数组，填充它，然后删除不需要的元素，但是我想要更快的东西。原因是我的实际数据帧很大（大约[50000, 455]），这将导致大约[12500，200，455]的numpy数组。

期待您的回答！

Answer 1

这里是一种方法：

ix = np.flatnonzero(df1.person_id != df1.person_id.shift(1))
np.split(df1.drop('person_id', axis=1).values, ix[1:])

[array([[ 0, 99, 77],
        [ 5, 11, 88]], dtype=int64), 
 array([[ 0, 22, 22],
        [ 7, 33, 66],
        [11, 44, 55]], dtype=int64), 
 array([[ 0, 22, 33]], dtype=int64)]

详细信息

在将df1与本身的移位版本（np.flatnonzero）进行比较之后，使用pd.shift，以获得在person_id中发生更改的索引：

ix = np.flatnonzero(df1.person_id != df1.person_id.shift(1))
#array([0, 2, 5])

使用np.split来根据获得的索引拆分数据框的关注列：

np.split(df1.drop('person_id', axis=1).values, ix[1:])

[array([[ 0, 99, 77],
        [ 5, 11, 88]], dtype=int64), 
 array([[ 0, 22, 22],
        [ 7, 33, 66],
        [11, 44, 55]], dtype=int64), 
 array([[ 0, 22, 33]], dtype=int64)]

Answer 2

您可以使用groupby：

import pandas as pd

df_raw = pd.DataFrame({"person_id": [101, 101, 102, 102, 102, 103], "date": [0, 5, 0, 7, 11, 0], "val1": [99, 11, 22, 33, 44, 22], "val2": [77, 88, 22, 66, 55, 33]})

result = [group.values for _, group in df_raw.groupby('person_id')[['date', 'val1', 'val2']]]
print(result)

输出

[array([[  0, 101,  99,  77],
       [  5, 101,  11,  88]]), array([[  0, 102,  22,  22],
       [  7, 102,  33,  66],
       [ 11, 102,  44,  55]]), array([[  0, 103,  22,  33]])]

Answer 3

使用xarray的另一种解决方案

让我们 创建由person_id

重复表示的尺寸
>>> df['newdim'] = df.person_id.duplicated() >>> df.newdim = df.groupby('person_id').newdim.cumsum() >>> df = df.set_index(["newdim", "person_id"]) >>> df date val1 val2 newdim person_id 0.0 101 0 99 77 1.0 101 5 11 88 0.0 102 0 22 22 1.0 102 7 33 66 2.0 102 11 44 55 0.0 103 0 22 33

为了便于阅读 ，我们可能希望将df变成xarray.Dataset对象

>>> xa = df.to_xarray() >>> xa <xarray.Dataset> Dimensions: (newdim: 3, person_id: 3) Coordinates: * newdim (newdim) float64 0.0 1.0 2.0 * person_id (person_id) int64 101 102 103 Data variables: date (newdim, person_id) float64 0.0 0.0 0.0 5.0 7.0 nan nan 11.0 nan val1 (newdim, person_id) float64 99.0 22.0 22.0 11.0 33.0 nan nan ... val2 (newdim, person_id) float64 77.0 22.0 33.0 88.0 66.0 nan nan ...

，然后放入一个 尺寸健康 numpy数组

>>> ar = xa.to_array().T.values >>> ar array([[[ 0., 99., 77.], [ 5., 11., 88.], [nan, nan, nan]], [[ 0., 22., 22.], [ 7., 33., 66.], [11., 44., 55.]], [[ 0., 22., 33.], [nan, nan, nan], [nan, nan, nan]]])

请注意，nan值是强制性引入的。

numpy：从熊猫数据帧创建可变长度序列

3 个答案: