Question

我有一个包含三个列的数据框：
1. ID（int64）：对象ID
2. DATETIME（datetime64 [ns]）：收集对象的过去四个值的日期和时间。频率可以小于或超过一小时。当连续两次收集分开1小时15分钟或更长时，可能会丢失大约1500万个区间的值 3. VALUES（字符串对象）：逗号分隔对象的四个值。每个值是过去15分钟间隔内的对象值。例如，在上午10点收集的值为＆＃34; 0,1,2,3＆＃34;，这意味着对象的值在9:45到10 AM之间为0，在9:30到9:45之间为1 AM等...

我希望以15分钟的频率对此数据帧进行重新采样，并且每15分钟间隔一个相应的值，没有任何明确的for循环（或使用最少的循环），因为它是一个巨大的数据帧，并且循环将使执行时间过长。 ..

以下是我对单个对象的示例：

ID,COLLECTION_DATETIME,VALUES
10000,2017-09-13 10:30:00,"2,1,0,3"
10000,2017-09-13 11:00:00,"6,5,2,1"
10000,2017-09-13 12:15:00,"0,0,0,2"

以下是我想要获得的结果：

ID,COLLECTION_DATETIME,VALUE
10000,2017-09-13 09:45:00,3
10000,2017-09-13 10:00:00,0
10000,2017-09-13 10:15:00,1
10000,2017-09-13 10:30:00,2
10000,2017-09-13 10:45:00,5
10000,2017-09-13 11:00:00,6
10000,2017-09-13 11:15:00,NaN
10000,2017-09-13 11:30:00,2
10000,2017-09-13 11:45:00,0
10000,2017-09-13 12:00:00,0
10000,2017-09-13 12:15:00,0

我想这可以通过使用＆＃39; COLLECTION_DATETIME＆＃39;来完成。作为索引的列和以15分钟频率重新取样，分割＆＃39; VALUES＆＃39; column（df [＆＃39; VALUES＆＃39;]。str.split（＆＃39;，＆＃39;，expand = True））并将其转置，以某种方式将结果影响到df.resample的新列（＆＃39; 15分钟＆＃39;）并删除重复的间隔，但我仍然无法做到任何想法或指示都会有所帮助。

Answer 1

您可以使用：

#change order of values
df['VALUES'] = df['VALUES'].str[::-1]
#repeat index by len of splitted values
a = df['VALUES'].str.split(',')
l = a.str.len()
#flatten column VALUES
df = df.loc[df.index.repeat(l)].assign(VALUES=np.concatenate(a))
#convert index to column and create unique index
df = df.reset_index(drop=True)

print (df)
    index     ID COLLECTION_DATETIME VALUES
0       0  10000 2017-09-13 10:30:00      3
1       0  10000 2017-09-13 10:30:00      0
2       0  10000 2017-09-13 10:30:00      1
3       0  10000 2017-09-13 10:30:00      2
4       1  10000 2017-09-13 11:00:00      1
5       1  10000 2017-09-13 11:00:00      2
6       1  10000 2017-09-13 11:00:00      5
7       1  10000 2017-09-13 11:00:00      6
8       2  10000 2017-09-13 12:15:00      2
9       2  10000 2017-09-13 12:15:00      0
10      2  10000 2017-09-13 12:15:00      0
11      2  10000 2017-09-13 12:15:00      0

#subtract timedelta by count each datetime
a = pd.to_timedelta(df[::-1].groupby('index').cumcount() * 15, unit='T')
df['COLLECTION_DATETIME'] = df['COLLECTION_DATETIME'] - a
df = df.set_index('COLLECTION_DATETIME').drop('index', axis=1)
#create unique DatetimeIndex and convert frequency
df = df.groupby(level=0).first().asfreq('15min')
#replace NaN by forward filling
df['ID'] = df['ID'].ffill().astype(int)
print (df)
                        ID VALUES
COLLECTION_DATETIME              
2017-09-13 09:45:00  10000      3
2017-09-13 10:00:00  10000      0
2017-09-13 10:15:00  10000      1
2017-09-13 10:30:00  10000      2
2017-09-13 10:45:00  10000      5
2017-09-13 11:00:00  10000      6
2017-09-13 11:15:00  10000    NaN
2017-09-13 11:30:00  10000      2
2017-09-13 11:45:00  10000      0
2017-09-13 12:00:00  10000      0
2017-09-13 12:15:00  10000      0

pandas - 如何上采样并为新单元格选择相应的值

1 个答案: