Question

我正在努力解决以下问题。我有一个包含hardware_serials和时间戳的数据集。

 hardware_serial  ...    Timestamp           
0    00053        ...  2020-10-26 16:04:41
1    77684             2020-10-26 16:02:23
2    00053             2020-10-26 15:59:41
3    77684             2020-10-26 15:57:23

目标是过滤出时间戳记比该hardware_serial的第一行晚4-7分钟的行。该hardware_serial上还有更多行，只有几小时后才行，我只想在4-7分钟内比较两条消息以过滤出重复项。

rawdata['Timestamp'] = pd.to_datetime(rawdata['date'] + ' ' + rawdata['time'])
        newdf = rawdata
    for x in range(len(rawdata.index)):
        for y in range(len(rawdata.index)):
                if rawdata['hardware_serial'].iloc[x] == rawdata['hardware_serial'].iloc[y]:
                    if y != x:
                            if a.iloc[x] - a.iloc[y] > timedelta(minutes=0):
                                if a.iloc[x] - a.iloc[y] <= timedelta(minutes=7):
                                    if x in newdf.index:
                                        newdf = newdf.drop(x)

此循环运行正常，只有600秒已花费15秒。然后您可以想象10E3行需要什么。

问题是，重复的行仅需要在下面的几行中进行检查，例如，下面最多20行。因为经过20行之后，已经过去了足够的时间，所以不再需要5分钟。

因此，我想到了创建一个嵌套循环，该循环将rawdata['Timestamp'].iloc(x)的值与另一个在AKA内部for循环下面，带有参数'y'的值分别为1,2,3，...，20行的值进行比较。我很难正确解释这一点，因此我的想法如下所示。

newdf = rawdata
for x in range(0,len(rawdata.index)-20):
    for y in range(1,20):
        if rawdata['hardware_serial'].iloc[x] == rawdata['hardware_serial'].iloc[x+y]:
                    if a.iloc[x] - a.iloc[x+y] > timedelta(minutes=0):
                        if a.iloc[x] - a.iloc[x+y] <= timedelta(minutes=7):
                            if x+y in newdf.index:
                                print(x + y)
                                print(a.iloc[x] - a.iloc[x + y])
                                newdf = newdf.drop(x+y)

但是，当我执行此嵌套的for循环时，每次更改y的范围时，数据帧的大小都会更改。因此，这无法正常工作。不幸的是，我不知道如何。任何帮助将不胜感激！

减少for循环python熊猫的循环时间

0 个答案: