Question

我试图弄清楚是否有一种方法可以使一个数据框具有多个字段，并且我想根据特定列的值是否在x的数量之内，将该数据框分段或分组为一个新的数据框。彼此吗？

   I.D  |      Created_Time            | Home_Longitude | Home_Latitude | Work_Longitude | Home_Latitude
  Faa1      2019-02-23 20:01:13.362           -77.0364            38.8951    -72.0364      38.8951

以上是原始df具有多行的外观。我想创建一个新的数据框，其中所有行或ID包含的创建时间都在x分钟之间，并且在另一个房屋x英里内使用Haversine，在另一个作品内x英里内使用Haversine。

因此，基本上，尝试将数据帧过滤到一个df中，该df仅包含订单创建时间的x分钟以内，另一个房屋内x英里，每个工作列值内x英里的行。

Answer 1

我这样做是

计算相对于第一行的距离（以英里为单位）和时间
- 我的逻辑
  - 如果n行位于第一行的 x分钟/英里之内，则这n行彼此之间在 x分钟/英里的内

使用所需的距离和时间过滤条件过滤数据

生成一些虚拟数据

random co-ordinates

# Generate random Lat-Long points def newpoint(): return uniform(-180,180), uniform(-90, 90) home_points = (newpoint() for x in range(289)) work_points = (newpoint() for x in range(289)) df = pd.DataFrame(home_points, columns=['Home_Longitude', 'Home_Latitude']) df[['Work_Longitude', 'Work_Latitude']] = pd.DataFrame(work_points) # Insert `ID` column as sequence of integers df.insert(0, 'ID', range(289)) # Generate random datetimes, separated by 5 minute intervals # (you can choose your own interval) times = pd.date_range('2012-10-01', periods=289, freq='5min') df.insert(1, 'Created_Time', times) print(df.head()) ID Created_Time Home_Longitude Home_Latitude Work_Longitude Work_Latitude 0 0 2012-10-01 00:00:00 -48.885981 -39.412351 -68.756244 24.739860 1 1 2012-10-01 00:05:00 58.584893 59.851739 -119.978429 -87.687858 2 2 2012-10-01 00:10:00 -18.623484 85.435248 -14.204142 -3.693993 3 3 2012-10-01 00:15:00 -29.721788 71.671103 -69.833253 -12.446204 4 4 2012-10-01 00:20:00 168.257968 -13.247833 60.979050 -18.393925

使用Haversine距离公式（vectorized haversine distance formula, in km）创建Python帮助器函数

def haversine(lat1, lon1, lat2, lon2, to_radians=False, earth_radius=6371): """ slightly modified version: of http://stackoverflow.com/a/29546836/2901002 Calculate the great circle distance between two points on the earth (specified in decimal degrees or in radians) All (lat, lon) coordinates must have numeric dtypes and be of equal length. """ if to_radians: lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2]) a = np.sin((lat2-lat1)/2.0)**2 + \ np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2 return earth_radius * 2 * np.arcsin(np.sqrt(a))

使用Haversine公式计算距离（相对于第一行），以km为单位。然后，将公里转换为英里

df['Home_dist_miles'] = \ haversine(df.Home_Longitude, df.Home_Latitude, df.loc[0, 'Home_Longitude'], df.loc[0, 'Home_Latitude'])*0.621371 df['Work_dist_miles'] = \ haversine(df.Work_Longitude, df.Work_Latitude, df.loc[0, 'Work_Longitude'], df.loc[0, 'Work_Latitude'])*0.621371

计算time differences, in minutes（相对于第一行）

对于此处的虚拟数据，时间差将为5分钟的倍数（但在真实数据中，它们可以是任意值）

df['time'] = df['Created_Time'] - df.loc[0, 'Created_Time'] df['time_min'] = (df['time'].dt.days * 24 * 60 * 60 + df['time'].dt.seconds)/60

应用过滤器（方法1），然后选择满足OP中所述条件的任意2行

home_filter = df['Home_dist_miles']<=12000 # within 12,000 miles work_filter = df['Work_dist_miles']<=8000 # within 8,000 miles time_filter = df['time_min']<=25 # within 25 minutes df_filtered = df.loc[(home_filter) & (work_filter) & (time_filter)] # Select any 2 rows that satisfy required conditions df_any2rows = df_filtered.sample(n=2) print(df_any2rows) ID Created_Time Home_Longitude Home_Latitude Work_Longitude Work_Latitude Home_dist_miles Work_dist_miles time time_min 0 0 2012-10-01 00:00:00 -168.956448 -42.970705 -6.340945 -12.749469 0.000000 0.000000 00:00:00 0.0 4 4 2012-10-01 00:20:00 -73.120352 13.748187 -36.953587 23.528789 6259.078588 5939.425019 00:20:00 20.0

应用过滤器（方法2），然后应用select any 2 rows满足OP中规定的条件

multi_query = """Home_dist_miles<=12000 & \ Work_dist_miles<=8000 & \ time_min<=25""" df_filtered = df.query(multi_query) # Select any 2 rows that satisfy required conditions df_any2rows = df_filtered.sample(n=2) print(df_any2rows) ID Created_Time Home_Longitude Home_Latitude Work_Longitude Work_Latitude Home_dist_miles Work_dist_miles time time_min 0 0 2012-10-01 00:00:00 -168.956448 -42.970705 -6.340945 -12.749469 0.000000 0.000000 00:00:00 0.0 4 4 2012-10-01 00:20:00 -73.120352 13.748187 -36.953587 23.528789 6259.078588 5939.425019 00:20:00 20.0

根据参数或数据框行下方的列中的差异对df进行细分或分组？

1 个答案: