我有一个问题,我有从鹿特丹到汉堡的几次旅行的AIS数据。该路线分为6个扇区,为该路线预先定义了扇区边界,我需要知道船舶何时何地进入下一个扇区。我尝试使用扇区中的最后一条记录,但数据的分辨率不够高。所以我想根据扇区边界的纬度插入时间和经度。
您可以在下图中看到我为此行程决定的边框。越过边界的经度总是恰好在边界线上。我需要确定的是船舶越过这条线的纬度。
我的DataFrame看起来像这样:
TripID time Latitude Longitude SectorID
0 42 7 52.9 4.4 1
1 42 8 53.0 4.6 1
2 42 9 53.0 4.7 1
3 42 10 53.1 4.9 2
4 5 9 53.0 4.5 1
5 5 10 53.0 4.7 1
6 5 11 53.2 5.0 2
7 5 12 53.3 5.2 2
其中扇区1和2之间的边界是在经度4.8处预先定义的,因此我想为每个行程和扇区边界插入经度4.8的纬度和时间。我猜一个好的解决方案会涉及df.groupby(['TripID', 'SectorID'])
。
我尝试为每个行程和扇区添加一个条目,只有扇区边界的纬度,然后使用interpolate
,但添加条目大约需要一个小时,插入缺失的值会崩溃程。
我正在寻找的结果应该是这样的:
TripID time Latitude Longitude SectorID
0 42 7 52.9 4.4 1
1 42 8 53.0 4.6 1
2 42 9 53.0 4.7 1
8 42 9.5 53.05 4.8 1
3 42 10 53.1 4.9 2
4 5 9 53.0 4.5 1
5 5 10 53.0 4.7 1
9 5 10.3 53.06 4.8 1
6 5 11 53.2 5.0 2
7 5 12 53.3 5.2 2
我也很高兴能够使用看起来像这样的结果:
TripID SectorID leave_lat leave_lon leave_time
42 1 53.05 4.8 9.5
5 1 53.06 4.8 10.3
请问,如果我对问题的描述不太清楚。
答案 0 :(得分:1)
由于通常的熊猫工作人员都没有发现这个好问题,因此我给您提供一些警告。这是我使用的示例输入:
TripID time Latitude Longitude
42 7 52.9 4.4
42 8 53.0 4.6
42 9 53.0 4.7 * missing value
42 10 53.1 4.9
42 11 53.2 4.9
42 12 53.3 5.3 * missing value
42 15 53.7 5.6
5 9 53.0 4.5
5 10 53.0 4.7 * missing value
5 11 53.2 5.0
5 12 53.4 5.2
5 14 53.6 5.3 * missing value
5 17 53.4 5.5
5 18 53.3 5.7
34 19 53.0 4.5
34 20 53.0 4.7
34 24 53.9 4.8 ** value already exists
34 25 53.8 4.9
34 27 53.8 5.3
34 28 53.8 5.3 * missing value
34 31 53.7 5.6
34 32 53.6 5.7
此代码:
import numpy as np
import pandas as pd
#import data
df = pd.read_csv("test.txt", delim_whitespace=True)
#set floating point output precision to prevent excessively long columns
pd.set_option("display.precision", 2)
#remember original column order
cols = df.columns
#define the sector borders
sectors = [4.8, 5.4]
#create all combinations of sector borders and TripIDs
dfborders = pd.DataFrame(index = pd.MultiIndex.from_product([df.TripID.unique(), sectors], names = ["TripID", "Longitude"])).reset_index()
#delete those combinations of TripID and Longitude that already exist in the original dataframe
dfborders = pd.merge(df, dfborders, on = ["TripID", "Longitude"], how = "right")
dfborders = dfborders[dfborders.isnull().any(axis = 1)]
#insert missing data points
df = pd.concat([df, dfborders])
#and sort dataframe to insert the missing data points in the right position
df = df[cols].groupby("TripID", sort = False).apply(pd.DataFrame.sort_values, ["Longitude", "time", "Latitude"])
#temporarily set longitude as index for value-based interpolation
df.set_index(["Longitude"], inplace = True, drop = False)
#interpolate group-wise
df = df.groupby("TripID", sort = False).apply(lambda g: g.interpolate(method = "index"))
#create sector ID column assuming that longitude is between -180 and +180
df["SectorID"] = np.digitize(df["Longitude"], bins = [-180] + sectors + [180])
#and reset index
df.reset_index(drop = True, inplace = True)
print(df)
产生以下输出:
TripID time Latitude Longitude SectorID
0 42 7.00 52.90 4.4 1
1 42 8.00 53.00 4.6 1
2 42 9.00 53.00 4.7 1
3 42 9.50 53.05 4.8 2 * interpolated data point
4 42 10.00 53.10 4.9 2
5 42 11.00 53.20 4.9 2
6 42 12.00 53.30 5.3 2
7 42 13.00 53.43 5.4 3 * interpolated data point
8 42 15.00 53.70 5.6 3
9 5 9.00 53.00 4.5 1
10 5 10.00 53.00 4.7 1
11 5 10.33 53.07 4.8 2 * interpolated data point
12 5 11.00 53.20 5.0 2
13 5 12.00 53.40 5.2 2
14 5 14.00 53.60 5.3 2
15 5 15.50 53.50 5.4 3 * interpolated data point
16 5 17.00 53.40 5.5 3
17 5 18.00 53.30 5.7 3
18 34 19.00 53.00 4.5 1
19 34 20.00 53.00 4.7 1
20 34 24.00 53.90 4.8 2
21 34 25.00 53.80 4.9 2
22 34 27.00 53.80 5.3 2
23 34 28.00 53.80 5.3 2
24 34 29.00 53.77 5.4 3 * interpolated data point
25 34 31.00 53.70 5.6 3
26 34 32.00 53.60 5.7 3
请注意。我不知道如何将丢失的行添加到位。我会问一个问题,怎么做。如果得到答案,我将在这里更新。在此之前,副作用是表格在TripID
的每个Longitude
中排序,并且假设Longitude
不会减少,实际上情况并非如此。>
答案 1 :(得分:0)
我以另一种方式解决了这个问题。因为这为我解决了问题,但不是我要求的确切解决方案,所以我将接受T先生的回答。无论如何,出于完整性考虑,我都会发布此消息,因此这是我的解决方案:
从我的问题中的DataFrame df
开始
TripID time Latitude Longitude SectorID
0 42 7 52.9 4.4 1
1 42 8 53.0 4.6 1
2 42 9 53.0 4.7 1
3 42 10 53.1 4.9 2
4 5 9 53.0 4.5 1
5 5 10 53.0 4.7 1
6 5 11 53.2 5.0 2
7 5 12 53.3 5.2 2
我使用了这段代码
df = df.sort_values('time')
df['next_lat'] = df.groupby('TripID')['Latitude'].shift(-1)
df['next_lon'] = df('TripID')['Longitude'].shift(-1)
df['next_time'] = df('TripID')['time'].shift(-1)
df['next_sector_id'] = df('TripID')['sector'].shift(-1)
df = df.sort_values(['TripID', 'time'])
df['next_trip_id'] = df['TripID'].shift(-1)
lasts = df[df['SectorID'] != df['next_sector_id']]
lasts.loc[lasts['SectorID'] == '1', 'sector_leave_lon'] = 4.8
lasts.loc[lasts['sector'] == '2', 'sector_leave_lat'] = lasts.loc[lasts['sector'] == '2', 'Latitude'] + ((lasts.loc[lasts['sector'] == '2', 'sector_leave_lon'] - lasts.loc[lasts['sector'] == '2', 'Longitude']) / (lasts.loc[lasts['sector'] == '2', 'next_lon'] - lasts.loc[lasts['sector'] == '2', 'Longitude'])) * (lasts.loc[lasts['sector'] == '2', 'next_lon'] - lasts.loc[lasts['sector'] == '2', 'Longitude'])
lasts.loc[lasts['sector'] == '2', 'sector_leave_time'] = lasts.loc[lasts['sector'] == '2', 'time'] + ((lasts.loc[lasts['sector'] == '2', 'sector_leave_lon'] - lasts.loc[lasts['sector'] == '2', 'Longitude']) / (lasts.loc[lasts['sector'] == '2', 'next_lon'] - lasts.loc[lasts['sector'] == '2', 'Longitude'])) * (lasts.loc[lasts['sector'] == '2', 'next_time'] - lasts.loc[lasts['sector'] == '2', 'time'])
df['sector_leave_lat'] = lasts['sector_leave_lat']
df['sector_leave_time'] = lasts['sector_leave_time']
df['sector_leave_lat'] = df(['TripID', 'sector'])['sector_leave_lat'].transform('last')
df['sector_leave_time'] = df(['TripID', 'sector'])['sector_leave_time'].transform('last')
df = df.drop(['next_lat', 'next_lon', 'next_time', 'next_sector_id', 'next_trip_id'], axis = 1)
给出这样的结果
TripID time Latitude Longitude SectorID sector_leave_lat sector_leave_time
0 42 7 52.9 4.4 1 53.05 9.5
1 42 8 53.0 4.6 1 53.05 9.5
2 42 9 53.0 4.7 1 53.05 9.5
3 42 10 53.1 4.9 2 NaN NaN
4 5 9 53.0 4.5 1 53.06 10.3
5 5 10 53.0 4.7 1 53.06 10.3
6 5 11 53.2 5.0 2 NaN NaN
7 5 12 53.3 5.2 2 NaN NaN
我希望这对那些实际解决方案没有用的人有所帮助。