识别满足数据帧中三个条件的两行的组

时间:2019-03-07 08:47:06

标签: python-3.x pandas if-statement filter haversine

我在下面有df,想确定满足以下所有条件的任意两个订单:

  1. 皮卡之间的距离小于X英里
  2. 下车距离减去Y英里
  3. 订单创建时间之间的差异少于Z分钟

将使用Haversine Import Haversine计算每一行的取件差异以及每一行或订单的下车差异。

我当前拥有的df如下所示:

  DAY   Order  pickup_lat  pickup_long     dropoff_lat dropoff_long  created_time
 1/3/19  234e    32.69        -117.1          32.63      -117.08   3/1/19 19:00
 1/3/19  235d    40.73        -73.98          40.73       -73.99   3/1/19 23:21
 1/3/19  253w    40.76        -73.99          40.76       -73.99   3/1/19 15:26
 2/3/19  231y    36.08        -94.2           36.07       -94.21   3/2/19 0:14
 3/3/19  305g    36.01        -78.92          36.01       -78.95   3/2/19 0:09
 3/3/19  328s    36.76        -119.83         36.74       -119.79  3/2/19 4:33
 3/3/19  286n    35.76        -78.78          35.78       -78.74   3/2/19 0:43

我希望输出df为满足上述条件的任意2个订单或行。我不确定的是如何为数据帧中的每一行计算该值,以返回满足这些条件的任何两行。

我希望我能正确解释所需的输出。感谢您的光临!

2 个答案:

答案 0 :(得分:3)

我不知道这是否是最佳解决方案,但我没有提出其他建议。我所做的:

  • 创建具有所有可能订单组合的数据框,
  • 计算了所有必需的度量,对于所有组合,我将这些度量列添加到了数据框,
  • 找到满足上述条件的行的索引。

代码:

#create dataframe with all combination 
from itertools import combinations

index_comb = list(combinations(trips.index, 2))#trip, your dataframe
col_names = trips.columns
orders1= pd.DataFrame([trips.loc[c[0],:].values for c in index_comb],columns=trips.columns,index = index_comb)
orders2= pd.DataFrame([trips.loc[c[1],:].values for c in index_comb],columns=trips.columns,index = index_comb)
orders2 = orders2.add_suffix('_1')
combined = pd.concat([orders1,orders2],axis=1)

from haversine import haversine

def distance(row):
    loc_0 = (row[0],row[1]) # (lat, lon)
    loc_1 = (row[2],row[3])
    return haversine(loc_0,loc_1,unit='mi')

#pickup diff
pickup_cols = ["pickup_long","pickup_lat","pickup_long_1","pickup_lat_1"]
combined[pickup_cols] = combined[pickup_cols].astype(float)
combined["pickup_dist_mi"] = combined[pickup_cols].apply(distance,axis=1)

#dropoff diff
dropoff_cols = ["dropoff_lat","dropoff_long","dropoff_lat_1","dropoff_long_1"]
combined[dropoff_cols] = combined[dropoff_cols].astype(float)
combined["dropoff_dist_mi"] = combined[dropoff_cols].apply(distance,axis=1)

#creation time diff
combined["time_diff_min"] = abs(pd.to_datetime(combined["created_time"])-pd.to_datetime(combined["created_time_1"])).astype('timedelta64[m]')

#Thresholds
Z = 600
Y = 400
X = 400

#find orders with below conditions
diff_time_Z = combined["time_diff_min"] < Z
pickup_dist_X =  combined["pickup_dist_mi"]<X
dropoff_dist_Y =  combined["dropoff_dist_mi"]<Y
contitions_idx = diff_time_Z & pickup_dist_X & dropoff_dist_Y
out = combined.loc[contitions_idx,["Order","Order_1","time_diff_min","dropoff_dist_mi","pickup_dist_mi"]]

数据输出:

        Order Order_1  time_diff_min  dropoff_dist_mi  pickup_dist_mi
(0, 5)  234e    328s          573.0       322.988195      231.300179
(1, 2)  235d    253w          475.0         2.072803        0.896893
(4, 6)  305g    286n           34.0        19.766096       10.233550

希望我能很好地理解您,这会有所帮助。

答案 1 :(得分:2)

使用如上所述的数据框。删除索引。我假设您的created_time列为日期时间格式。

import pandas as pd
from geopy.distance import geodesic

交叉合并数据框以获取“订单”的所有可能组合。

df_all = pd.merge(df.assign(key=0), df.assign(key=0), on='key').drop('key', axis=1)

删除顺序相同的所有行。

df_all = df_all[-(df_all['Order_x'] == df_all['Order_y'])].copy()

删除重复的行,其中Order_x,Order_y == [a,b]和[b,a]

# drop duplicate rows
# first combine Order_x and Order_y into a sorted list, and combine into a string
df_all['dup_order'] = df_all[['Order_x', 'Order_y']].values.tolist()
df_all['dup_order'] = df_all['dup_order'].apply(lambda x: "".join(sorted(x)))

# drop the duplicates and reset the index
df_all = df_all.drop_duplicates(subset=['dup_order'], keep='first')
df_all.reset_index(drop=True)

创建一列以分钟为单位计算时差。

df_all['time'] = (df_all['dt_ceated_x'] - df_all['dt_ceated_y']).abs().astype('timedelta64[m]')

创建一列并计算下车点之间的距离。

df_all['dropoff'] = df_all.apply(
    (lambda row: geodesic(
        (row['dropoff_lat_x'], row['dropoff_long_x']),
        (row['dropoff_lat_x'], row['dropoff_long_y'])
    ).miles),
    axis=1
)

创建一列并计算拾音器之间的距离。

df_all['pickup'] = df_all.apply(
    (lambda row: geodesic(
        (row['pickup_lat_x'], row['pickup_long_x']),
        (row['pickup_lat_x'], row['pickup_long_y'])
    ).miles),
    axis=1
)

根据需要过滤结果。

X = 1500
Y = 2000
Z = 100

mask_pickups = df_all['pickup'] < X
mask_dropoff = df_all['dropoff'] < Y
mask_time = df_all['time'] < Z

print(df_all[mask_pickups & mask_dropoff & mask_time][['Order_x', 'Order_y', 'time', 'dropoff', 'pickup']])

Order_x Order_y  time      dropoff       pickup
10    235d    231y  53.0  1059.026620  1059.026620
11    235d    305g  48.0   260.325370   259.275948
13    235d    286n  82.0   249.306279   251.929905
25    231y    305g   5.0   853.308110   854.315567
27    231y    286n  29.0   865.026077   862.126593
34    305g    286n  34.0    11.763787     7.842526