我按频率 test_df2
对数据框 'B'
分组(按工作日,因此组的每个名称都是当天的日期 00:00),现在我正在遍历组以计算时间戳差异并将它们保存在字典 grouped_bins
中。原始数据帧和组中的数据如下所示:
时间戳 | 状态 | externalId | |
---|---|---|---|
0 | 2020-05-11 13:06:05.922 | 1 | 1 |
7 | 2020-05-11 13:14:29.759 | 10 | 1 |
8 | 2020-05-11 13:16:09.147 | 1 | 2 |
16 | 2020-05-11 13:19:08.641 | 10 | 2 |
我想要的是计算每行的 timestamp
之间的差异,例如行 7
和 0
,因为它们具有相同的 externalId
。
我为此做了以下工作。
# Group function. Dataframes are saved in a dict.
def groupDataWithFrequency(self, dataFrameLabel: str, groupKey: str, frequency: str):
'''Group time series by frequency. Starts at the beginning of the data frame.'''
print(f"Binning {dataFrameLabel} data with frequency {frequency}")
if (isinstance(groupKey, str)):
return self.dataDict[dataFrameLabel].groupby(pd.Grouper(key=groupKey, freq=frequency, origin="start"))
grouped_jobstates = groupDataWithFrequency("jobStatus", "timestamp", frequency)
分组后,我遍历每个组(以维持一天)并尝试计算时间间隔之间的差异,这就是出错的地方。
grouped_bins = {}
def jobStatusPRAggregator(data, name):
if (data["status"] == 1):
# Find corresponding element in original dataframe
correspondingStatus = test_df2.loc[(test_df2["externalId"] == data["externalId"]) & (test_df2["timestamp"] != data["timestamp"])]
# Calculate time difference
time = correspondingStatus["timestamp"] - data["timestamp"]
# some prints:
print(type(time))
# <class 'pandas.core.series.Series'> --> Why is this a series?
print(time.array[0])
# 0 days 00:08:23.837000 --> This looks correct, I think?
print(time)
# 7 0 days 00:08:23.837000
# Name: timestamp, dtype: timedelta64[ns]
# Check if element exists in dict
elem = next((x for x in grouped_bins if ((x["startDate"] == name) ("productiveTime" in x))), None)
# If does not exist yet, add to dict
if elem is None:
grouped_bins.append( {"startDate": name, "productiveTime": time })
else:
elem["productiveTime"] = elem["productiveTime"]
# See below for problem
# Loop over groups
for name, group in grouped_jobstates:
group.apply(jobStatusPRAggregator, args=(name,), axis=1)
我面临的问题如下。 dict(elem
)中的元素最后是这样的:
{'startDate': Timestamp('2020-05-11 00:00:00', freq='B'), 'productiveTime': 0 NaT
7 NaT
8 NaT
16 NaT
17 NaT
..
1090 NaT
1091 NaT
1099 NaT
1100 NaT
1107 NaT
Name: timestamp, Length: 254, dtype: timedelta64[ns]}
我想要的是这样的:
{'startDate': Timestamp('2020-05-11 00:00:00', freq='B'), 'productiveTime': 2 Days 12 hours 39 minutes 29 seconds
Name: timestamp, Length: 254, dtype: timedelta64[ns]}
尽管我愿意接受有关如何在 Python/Pandas 中存储持续时间的建议。
我也愿意接受有关循环本身的建议。
答案 0 :(得分:1)
要将时间戳字符串转换为日期时间对象:
# Calculate time difference
time1 = datetime.strptime(correspondingStatus["timestamp"], datetime_format)
time2 = datetime.strptime(data["timestamp"], datetime_format)
time = time1 - time2
然后从你上面的代码:
my_function <- function(data_set, parameter_1, parameter_2) {
... }
data_frame <- something
parameter_list <- c(1, 2)
my_function(data_set = data_frame , parameter_list[1:2])
答案 1 :(得分:1)
要获取同一 externalId
的连续行之间的时间戳差异,您应该可以简单地编写,例如:
df2 = df.assign(delta=df.groupby('externalId')['timestamp'].diff())
在你给出的例子中:
>>> df2
timestamp status externalId delta
0 2020-05-11 13:06:05.922 1 1 NaT
7 2020-05-11 13:14:29.759 10 1 0 days 00:08:23.837000
8 2020-05-11 13:16:09.147 1 2 NaT
16 2020-05-11 13:19:08.641 10 2 0 days 00:02:59.494000
如果您的时间戳实际上还不是 Timestamp
类型,那么您可以先将它们转换成它:
df['timestamp'] = pd.to_datetime(df['timestamp'])