Pandas Use Loop将每行中的日期时间与所有行进行比较并保存结果子集

时间:2016-07-09 21:56:08

标签: python datetime pandas for-loop filter

这是我第一次使用Python(之前我使用过R),所以请关注这个问题。基本上,我想使用for循环来比较每行中的datetime值与pandas datetime数据帧中其他行中的所有其他pd值,如果时间差异为4小时或更短时间将这些行存储到子集对象df中以供稍后处理。但是,我不确定从哪里开始。

我们假设这是我的数据集:

              Origin           Destination                Time
0           New York                 Cairo 2016-03-28 02:00:00
1           New York           Los Angeles 2016-03-28 04:00:00
2             Boston                Hawaii 2016-03-28 06:00:00
3           New York                Boston 2016-03-28 08:00:00
4        Los Angeles                Boston 2016-03-28 10:00:00
5        Los Angeles                Hawaii 2016-03-28 12:00:00

这就是结果应该是这样的:

>>>df[0]
              Origin           Destination                Time
0           New York                 Cairo 2016-03-28 02:00:00
>>>df[1]
              Origin           Destination                Time
0           New York                 Cairo 2016-03-28 02:00:00
1           New York           Los Angeles 2016-03-28 04:00:00
>>>df[2]
              Origin           Destination                Time
0           New York                 Cairo 2016-03-28 02:00:00
1           New York           Los Angeles 2016-03-28 04:00:00
2             Boston                Hawaii 2016-03-28 06:00:00
>>>df[3]
1           New York           Los Angeles 2016-03-28 04:00:00
2             Boston                Hawaii 2016-03-28 06:00:00
3           New York                Boston 2016-03-28 08:00:00
>>>df[5]
              Origin           Destination                Time
3           New York                Boston 2016-03-28 08:00:00
4        Los Angeles                Boston 2016-03-28 10:00:00
5        Los Angeles                Hawaii 2016-03-28 12:00:00

我不明白怎么弄这个。

3 个答案:

答案 0 :(得分:4)

如果你想要一个没有任何循环的纯熊猫解决方案,你可以这样做:

  1. 将数据与自身进行交叉连接
  2. 选择时间差<1的行。 4小时
  3. 对数据进行分组
  4. 以下是一个例子:

    # Load file
    data = pd.read_csv("abc.csv", delimiter="\t")
    data["Time"] = pd.to_datetime(data["Time"], infer_datetime_format=True)
    data["Ignore"] = 1
    data = data.reset_index()
    
    # cross-join
    merged = pd.merge(data, data, how="outer", on="Ignore")
    
    # this is the magic
    merged = merged[(merged["Time_x"] - merged["Time_y"]).abs() < pd.Timedelta("4 hours")]
    
    # so you have some structure
    groups = merged.groupby("index_x").apply(lambda x : x.set_index("index_y")[["Origin_y", "Destination_y", "Time_y"]])
    

    这会给你一个这样的结果:

            Origin_y    Destination_y   Time_y
    index_x index_y         
    0   0   New York    Cairo   2016-03-28 02:00:00
        1   New York    Los Angeles 2016-03-28 04:00:00
    1   0   New York    Cairo   2016-03-28 02:00:00
        1   New York    Los Angeles 2016-03-28 04:00:00
        2   Boston  Hawaii  2016-03-28 06:00:00
    2   1   New York    Los Angeles 2016-03-28 04:00:00
        2   Boston  Hawaii  2016-03-28 06:00:00
        3   New York    Boston  2016-03-28 08:00:00
    3   2   Boston  Hawaii  2016-03-28 06:00:00
        3   New York    Boston  2016-03-28 08:00:00
    ...
    

    您可以像这样访问各个行:

    > groups.T[0].T
    
    Origin_y    Destination_y   Time_y
    index_y         
    0   New York    Cairo   2016-03-28 02:00:00
    1   New York    Los Angeles 2016-03-28 04:00:00
    

答案 1 :(得分:2)

从这开始:

                Origin              Destination                  Time
0             New York                    Cairo   2016-03-28 00:00:00
1             New York               Los Angeles  2016-03-28 02:00:00
2               Boston                    Hawaii  2016-03-28 04:00:00
3             New York                   Boston   2016-03-28 06:00:00
4          Los Angeles                   Boston   2016-03-28 08:00:00
5          Los Angeles                  Hawaii    2016-03-28 10:00:00

使用dict存储您的DataFrame,然后使用Index of来访问Dict 数据帧。

NewDict = {} 
for i, e in df.iterrows():
    NewDict[i] = df[ (df['Time'] > e['Time']-pd.Timedelta('4 hours')) & (df['Time'] < e['Time'] + pd.Timedelta('4 hours'))]

NewDict[0]

                Origin             Destination                  Time
0             New York                   Cairo   2016-03-28 00:00:00
1             New York              Los Angeles  2016-03-28 02:00:00

NewDict[4]
                Origin              Destination                  Time
3             New York                   Boston   2016-03-28 06:00:00
4          Los Angeles                   Boston   2016-03-28 08:00:00
5          Los Angeles                  Hawaii    2016-03-28 10:00:00

获得计数:

for k, v in NewDict.iteritems():
     print "Key" ,k,"has" , len(v), "items"

Key 0 has 2 items
Key 1 has 3 items
Key 2 has 3 items
Key 3 has 3 items
Key 4 has 3 items
Key 5 has 2 items

编辑以反向循环:

reverse = df.reindex(index=df.index[::-1]) 
revSorted = {} 
for i, e in reverse.iterrows(): 
    revSorted[i] = reverse[ (reverse['Time'] > e['Time']-pd.Timedelta('4 hours')) & (reverse['Time'] < e['Time'] + pd.Timedelta('4 hours'))]

答案 2 :(得分:1)

循环的逻辑是:

df = []
for i, row in enumerate(rows):
    df.append([row])
    try:
        for next_row in rows[i + 1:]:
            if abs(row['Time'] - next_row['Time']) < timedelta(hours=4):
                df[i].append(next_row)
            else:
                break
    except IndexError:
        continue