我试图弄清楚是否有一种方法可以使一个数据框具有多个字段,并且我想根据特定列的值是否在x的数量之内,将该数据框分段或分组为一个新的数据框。彼此吗?
I.D | Created_Time | Home_Longitude | Home_Latitude | Work_Longitude | Home_Latitude
Faa1 2019-02-23 20:01:13.362 -77.0364 38.8951 -72.0364 38.8951
以上是原始df具有多行的外观。 我想创建一个新的数据框,其中所有行或ID包含的创建时间都在x分钟之间,并且在另一个房屋x英里内使用Haversine,在另一个作品内x英里内使用Haversine。
因此,基本上,尝试将数据帧过滤到一个df中,该df仅包含订单创建时间的x分钟以内,另一个房屋内x英里,每个工作列值内x英里的行。
答案 0 :(得分:0)
我这样做是
生成一些虚拟数据
# Generate random Lat-Long points
def newpoint():
return uniform(-180,180), uniform(-90, 90)
home_points = (newpoint() for x in range(289))
work_points = (newpoint() for x in range(289))
df = pd.DataFrame(home_points, columns=['Home_Longitude', 'Home_Latitude'])
df[['Work_Longitude', 'Work_Latitude']] = pd.DataFrame(work_points)
# Insert `ID` column as sequence of integers
df.insert(0, 'ID', range(289))
# Generate random datetimes, separated by 5 minute intervals
# (you can choose your own interval)
times = pd.date_range('2012-10-01', periods=289, freq='5min')
df.insert(1, 'Created_Time', times)
print(df.head())
ID Created_Time Home_Longitude Home_Latitude Work_Longitude Work_Latitude
0 0 2012-10-01 00:00:00 -48.885981 -39.412351 -68.756244 24.739860
1 1 2012-10-01 00:05:00 58.584893 59.851739 -119.978429 -87.687858
2 2 2012-10-01 00:10:00 -18.623484 85.435248 -14.204142 -3.693993
3 3 2012-10-01 00:15:00 -29.721788 71.671103 -69.833253 -12.446204
4 4 2012-10-01 00:20:00 168.257968 -13.247833 60.979050 -18.393925
使用Haversine距离公式(vectorized haversine distance formula, in km)创建Python帮助器函数
def haversine(lat1, lon1, lat2, lon2, to_radians=False, earth_radius=6371):
"""
slightly modified version: of http://stackoverflow.com/a/29546836/2901002
Calculate the great circle distance between two points
on the earth (specified in decimal degrees or in radians)
All (lat, lon) coordinates must have numeric dtypes and be of equal length.
"""
if to_radians:
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + \
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))
使用Haversine公式计算距离(相对于第一行),以km为单位。然后,将公里转换为英里
df['Home_dist_miles'] = \
haversine(df.Home_Longitude, df.Home_Latitude,
df.loc[0, 'Home_Longitude'], df.loc[0, 'Home_Latitude'])*0.621371
df['Work_dist_miles'] = \
haversine(df.Work_Longitude, df.Work_Latitude,
df.loc[0, 'Work_Longitude'], df.loc[0, 'Work_Latitude'])*0.621371
计算time differences, in minutes(相对于第一行)
df['time'] = df['Created_Time'] - df.loc[0, 'Created_Time']
df['time_min'] = (df['time'].dt.days * 24 * 60 * 60 + df['time'].dt.seconds)/60
应用过滤器(方法1),然后选择满足OP中所述条件的任意2行
home_filter = df['Home_dist_miles']<=12000 # within 12,000 miles
work_filter = df['Work_dist_miles']<=8000 # within 8,000 miles
time_filter = df['time_min']<=25 # within 25 minutes
df_filtered = df.loc[(home_filter) & (work_filter) & (time_filter)]
# Select any 2 rows that satisfy required conditions
df_any2rows = df_filtered.sample(n=2)
print(df_any2rows)
ID Created_Time Home_Longitude Home_Latitude Work_Longitude Work_Latitude Home_dist_miles Work_dist_miles time time_min
0 0 2012-10-01 00:00:00 -168.956448 -42.970705 -6.340945 -12.749469 0.000000 0.000000 00:00:00 0.0
4 4 2012-10-01 00:20:00 -73.120352 13.748187 -36.953587 23.528789 6259.078588 5939.425019 00:20:00 20.0
应用过滤器(方法2),然后应用select any 2 rows满足OP中规定的条件
multi_query = """Home_dist_miles<=12000 & \
Work_dist_miles<=8000 & \
time_min<=25"""
df_filtered = df.query(multi_query)
# Select any 2 rows that satisfy required conditions
df_any2rows = df_filtered.sample(n=2)
print(df_any2rows)
ID Created_Time Home_Longitude Home_Latitude Work_Longitude Work_Latitude Home_dist_miles Work_dist_miles time time_min
0 0 2012-10-01 00:00:00 -168.956448 -42.970705 -6.340945 -12.749469 0.000000 0.000000 00:00:00 0.0
4 4 2012-10-01 00:20:00 -73.120352 13.748187 -36.953587 23.528789 6259.078588 5939.425019 00:20:00 20.0