Question

我有一个由ID组成的数据集，每个ID都存在于一系列时间戳的某个子集上。有1813个时间戳[0，...，1812]，并且在所有时间戳上都存在一些ID，一些在范围内（0，n），一些在（n，m）之上，一些在（m，1812）之上。每个ID在每个时间戳上都有108个功能。

我目前使用以下行创建ndarray：

# Shape: (1424, ?, 108) = (numIDs, numIDTimestamps, numFeatures)
inputMatrix = np.array([df.loc[df['id'] == ID, [feature for feature in features]].as_matrix() for ID in IDs])

此处，维度1中的每个元素的长度等于此ID存在的时间戳的数量。相反，我需要此维度中的每个元素的长度为1813，填充给定ID的任何不存在的时间戳，其数组为长度为108的数组。

在伪代码中：

for each ID:
    for each timestamps:
        if ID exists at timestamp:
            append its array of 108 features
        else:
            append array of 108 0s

以与我之前所做的相似的方式实现这一目标的最有效的Pythonic方法是什么？

修改

以下是我导入Pandas DataFrame的数据集的示例结构：

id      timestamp   derived_0   ...     technical_108     y
10      0           0.370326    ...     NaN             -0.011753
11      0           0.014765    ...     NaN             -0.001240
12      0           -0.010622   ...     NaN             -0.020940
25      0           NaN         ...     NaN             -0.015959
26      0           0.176693    ...     NaN             -0.007338

...     ...         ...         ...     ...             ...

2150    1812        -0.123364   ...     0.001004        0.004604
2151    1812        -10.437184  ...     0.044597        -0.009241
2154    1812        -0.077930   ...     0.030816        -0.006852
2156    1812        -0.269845   ...     -0.011706       -0.000785
2158    1812        NaN         ...     NaN             0.003497

这是我在上面imputMatrix行之前所做的处理：

df = df.fillna(df.mean())

# SORT BY LAST TIMESTAMP
df = df.assign(start=df.groupby('id')['timestamp'].transform('min'),
               end=df.groupby('id')['timestamp'].transform('max'))\
               .sort_values(by=['end', 'start', 'timestamp'])

cols = list(df)
featureNames = ['derived', 'fundamental', 'technical']
features = [col for col in cols if col.split('_')[0] in featureNames]
numFeatures = len(features)
IDs = list((df['id'].unique()))                 # Sorted by ascending last timestamp
timestamps = list(df['timestamp'].unique())     # Sorted

＆＃34;按上次时间戳排序＆＃34;表示重新排序DataFrame的行，以便具有最低结束时间戳的ID是第一个，并且仍按其时间戳排序。

e.g：

id      timestamp    ...
1314    0            ...
1314    1
1314    2
1699    0
1699    1
1699    2
1699    3

...

Answer 1

您可以为时间戳为0到1812的每个id附加一个系列，然后删除时间戳和ID重复且y列丢失的情况。

此代码的草图如下：

for ID in IDs:
    df.ix[df['id']==ID, 'timestamp'] = df.ix[df['id']==ID, 'timestamp'].append(pd.Series(range(0, 1813)))

df.drop[df.duplicated(subset=('id', 'timestamp'), keep=False) and pd.isnull(df['y'])]

在此之后，您可以应用现有代码。

用0s填充ndarray的一个维度

1 个答案: