熊猫:减去时间戳

时间:2021-06-14 21:45:33

标签: python pandas

我按频率 test_df2 对数据框 'B' 分组(按工作日,因此组的每个名称都是当天的日期 00:00),现在我正在遍历组以计算时间戳差异并将它们保存在字典 grouped_bins 中。原始数据帧和组中的数据如下所示:

<头>
时间戳 状态 externalId
0 2020-05-11 13:06:05.922 1 1
7 2020-05-11 13:14:29.759 10 1
8 2020-05-11 13:16:09.147 1 2
16 2020-05-11 13:19:08.641 10 2

我想要的是计算每行的 timestamp 之间的差异,例如行 70,因为它们具有相同的 externalId

我为此做了以下工作。

# Group function. Dataframes are saved in a dict.
def groupDataWithFrequency(self, dataFrameLabel: str, groupKey: str, frequency: str):
    '''Group time series by frequency. Starts at the beginning of the data frame.'''
    print(f"Binning {dataFrameLabel} data with frequency {frequency}")
    if (isinstance(groupKey, str)):
        return self.dataDict[dataFrameLabel].groupby(pd.Grouper(key=groupKey, freq=frequency, origin="start"))

grouped_jobstates = groupDataWithFrequency("jobStatus", "timestamp", frequency)

分组后,我遍历每个组(以维持一天)并尝试计算时间间隔之间的差异,这就是出错的地方。

grouped_bins = {}

def jobStatusPRAggregator(data, name):
    if (data["status"] == 1):

        # Find corresponding element in original dataframe
        correspondingStatus = test_df2.loc[(test_df2["externalId"] == data["externalId"]) & (test_df2["timestamp"] != data["timestamp"])]

        # Calculate time difference
        time = correspondingStatus["timestamp"] - data["timestamp"]

        # some prints:
        print(type(time))
        # <class 'pandas.core.series.Series'> --> Why is this a series?

        print(time.array[0])
        # 0 days 00:08:23.837000 --> This looks correct, I think?

        print(time)
        # 7   0 days 00:08:23.837000
        # Name: timestamp, dtype: timedelta64[ns]
    
        # Check if element exists in dict
        elem = next((x for x in grouped_bins if ((x["startDate"] == name) ("productiveTime" in x))), None)
      
        # If does not exist yet, add to dict
        if elem is None:
            grouped_bins.append( {"startDate": name, "productiveTime": time })
        else:
            elem["productiveTime"] = elem["productiveTime"]
            # See below for problem

# Loop over groups
for name, group in grouped_jobstates:
    group.apply(jobStatusPRAggregator, args=(name,), axis=1)

我面临的问题如下。 dict(elem)中的元素最后是这样的:

{'startDate': Timestamp('2020-05-11 00:00:00', freq='B'), 'productiveTime': 0      NaT
7      NaT
8      NaT
16     NaT
17     NaT
        ..
1090   NaT
1091   NaT
1099   NaT
1100   NaT
1107   NaT
Name: timestamp, Length: 254, dtype: timedelta64[ns]}

我想要的是这样的:

{'startDate': Timestamp('2020-05-11 00:00:00', freq='B'), 'productiveTime': 2 Days 12 hours 39 minutes 29 seconds
Name: timestamp, Length: 254, dtype: timedelta64[ns]}

尽管我愿意接受有关如何在 Python/Pandas 中存储持续时间的建议。

我也愿意接受有关循环本身的建议。

2 个答案:

答案 0 :(得分:1)

要将时间戳字符串转换为日期时间对象:

# Calculate time difference
time1 = datetime.strptime(correspondingStatus["timestamp"], datetime_format)
time2 = datetime.strptime(data["timestamp"], datetime_format)
time = time1 - time2

然后从你上面的代码:

my_function <- function(data_set, parameter_1, parameter_2) {
... }

data_frame <- something

parameter_list <- c(1, 2)

my_function(data_set = data_frame , parameter_list[1:2])

答案 1 :(得分:1)

要获取同一 externalId 的连续行之间的时间戳差异,您应该可以简单地编写,例如:

df2 = df.assign(delta=df.groupby('externalId')['timestamp'].diff())

在你给出的例子中:

>>> df2
                 timestamp  status  externalId                  delta
0  2020-05-11 13:06:05.922       1           1                    NaT
7  2020-05-11 13:14:29.759      10           1 0 days 00:08:23.837000
8  2020-05-11 13:16:09.147       1           2                    NaT
16 2020-05-11 13:19:08.641      10           2 0 days 00:02:59.494000

如果您的时间戳实际上还不是 Timestamp 类型,那么您可以先将它们转换成它:

df['timestamp'] = pd.to_datetime(df['timestamp'])