我有两个数据集:一个带有航点,另一个带有航迹。
我想根据跟踪的时间将航迹数据集的“模式”变量添加到航点数据集中
在航点数据集中,我有变量“ tracked_at”
`
ID <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15, 16, 17, 18, 19, 20)
tracked_at <- ymd_hms(c("2017-10-16 06:00:02", "2017-10-16 06:00:07", "2017-10-16 06:01:15", "2017-10-16 06:02:09",
"2017-10-16 06:02:50", "2017-10-16 06:04:05", "2017-10-16 06:04:15", "2017-10-16 06:10:15",
"2017-10-16 06:14:15", "2017-10-16 06:16:15", "2017-10-16 06:18:30", "2017-10-16 06:18:45",
"2017-10-16 06:19:15", "2017-10-16 06:19:40", "2017-10-16 06:19:55", "2017-10-17 08:08:02",
"2017-10-17 08:10:02", "2017-10-17 08:16:02", "2017-10-17 08:17:02", "2017-10-18 15:00:00"))
lat <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
long <- c(2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
id_user<- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,3)
df1 <- data.frame(ID, tracked_at,lat, long, id_user )
print(df1)
ID tracked_at lat long id_user
1 1 2017-10-16 06:00:02 1 2 1
2 2 2017-10-16 06:00:07 1 2 1
3 3 2017-10-16 06:01:15 1 2 1
4 4 2017-10-16 06:02:09 1 2 1
5 5 2017-10-16 06:02:50 1 2 1
6 6 2017-10-16 06:04:05 1 2 1
7 7 2017-10-16 06:04:15 1 2 1
8 8 2017-10-16 06:10:15 1 2 1
9 9 2017-10-16 06:14:15 1 2 1
10 10 2017-10-16 06:16:15 1 2 1
11 11 2017-10-16 06:18:30 1 2 1
12 12 2017-10-16 06:18:45 1 2 1
13 13 2017-10-16 06:19:15 1 2 1
14 14 2017-10-16 06:19:40 1 2 1
15 15 2017-10-16 06:19:55 1 2 1
16 16 2017-10-17 08:08:02 1 2 2
17 17 2017-10-17 08:10:02 1 2 2
18 18 2017-10-17 08:16:02 1 2 2
19 19 2017-10-17 08:17:02 1 2 2
20 20 2017-10-18 15:00:00 1 2 3
在轨迹数据集中,我有变量“ started_at”和“ finished_a”
started_at <- ymd_hms(c("2017-10-16 06:00:05", "2017-10-16 06:04:15", "2017-10-16 06:18:31", "2017-10-17 08:10:02"))
finished_a <- ymd_hms(c("2017-10-16 06:02:10", "2017-10-16 06:18:30", "2017-10-16 06:19:45", "2017-10-17 08:16:02"))
id_user <- c(1, 1, 1, 2)
Mode <- c("Walk", "Train", "Walk", "Car")
df2 <- data.frame(started_at,finished_a, id_user, Mode )
print(df2)
started_at finished_a id_user Mode
1 2017-10-16 06:00:05 2017-10-16 06:02:10 1 Walk
2 2017-10-16 06:04:15 2017-10-16 06:18:30 1 Train
3 2017-10-16 06:18:31 2017-10-16 06:19:45 1 Walk
4 2017-10-17 08:10:02 2017-10-17 08:16:02 2 Car
这3个变量是日期格式(ymd_hms),当个人不动时也会跟踪航路点,因此模式列应在大多数时间由NA填充。我想补充一点,如果NA在“巴士”或“火车”模式之前以及“步行”模式之后,那么它就是“等待时间”。
这将是理想的数据集:
ID <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15, 16, 17, 18, 19, 20)
tracked_at <- ymd_hms(c("2017-10-16 06:00:02", "2017-10-16 06:00:07", "2017-10-16 06:01:15", "2017-10-16 06:02:09",
"2017-10-16 06:02:50", "2017-10-16 06:04:05", "2017-10-16 06:04:15", "2017-10-16 06:10:15",
"2017-10-16 06:14:15", "2017-10-16 06:16:15", "2017-10-16 06:18:30", "2017-10-16 06:18:45",
"2017-10-16 06:19:15", "2017-10-16 06:19:40", "2017-10-16 06:19:55", "2017-10-17 08:08:02",
"2017-10-17 08:10:02", "2017-10-17 08:16:02", "2017-10-17 08:17:02", "2017-10-18 15:00:00"))
lat <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1)
long <- c(2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
id_user<- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,3)
NeWmode <- c("NA","Walk", "Walk", "Walk", "Waiting Time", "Waiting Time", "Train", "Train", "Train", "Train", "Train",
"Walk","Walk", "Walk", "NA", "NA", "Car", "Car", "NA", "NA")
df3 <- data.frame(ID, tracked_at,lat, long, id_user, Newmode )
print(df3)
ID tracked_at lat long id_user Newmode
1 1 2017-10-16 06:00:02 1 2 1 NA
2 2 2017-10-16 06:00:07 1 2 1 Walk
3 3 2017-10-16 06:01:15 1 2 1 Walk
4 4 2017-10-16 06:02:09 1 2 1 Walk
5 5 2017-10-16 06:02:50 1 2 1 Waiting Time
6 6 2017-10-16 06:04:05 1 2 1 Waiting Time
7 7 2017-10-16 06:04:15 1 2 1 Train
8 8 2017-10-16 06:10:15 1 2 1 Train
9 9 2017-10-16 06:14:15 1 2 1 Train
10 10 2017-10-16 06:16:15 1 2 1 Train
11 11 2017-10-16 06:18:30 1 2 1 Train
12 12 2017-10-16 06:18:45 1 2 1 Walk
13 13 2017-10-16 06:19:15 1 2 1 Walk
14 14 2017-10-16 06:19:40 1 2 1 Walk
15 15 2017-10-16 06:19:55 1 2 1 NA
16 16 2017-10-17 08:08:02 1 2 2 NA
17 17 2017-10-17 08:10:02 1 2 2 Car
18 18 2017-10-17 08:16:02 1 2 2 Car
19 19 2017-10-17 08:17:02 1 2 2 NA
20 20 2017-10-18 15:00:00 1 2 3 NA
我最好的猜测是:
id1 <-filter(df1, id_user==1) #filtering by users
id1Moda <- filter(df2, id_user==1)
id1Moda$mode.num[id1Moda$Mode=="Walk"] <-1
id1Moda$mode.num[id1Moda$Mode=="Train"] <-2
id1Moda$mode.num[id1Moda$Mode=="Car"] <-3
id1$mode <- NA
for(i in 1:nrow(id1Moda)){
for(k in 1:nrow(id1)){
if((id1$tracked_at[k] >= id1Moda$started_at[i]) & (id1$tracked_at[k] <= id1Moda$finished_a[i])){
id1$Newmode[k] <- id1Moda$mode.num[i]
} else {
}
}
}
理想情况下,我没有按用户过滤,因为我有50位用户进行分析 而且,循环非常慢,因为该ID的数据集超过280 000点。整个数据库总共获得了38000000个数据点。
备注:
id1Moda$mode.num[id1Moda$Mode=="Walk"] <-1
谢谢您的帮助!
答案 0 :(得分:1)
尝试一下。
使用联接和过滤器完成任务。内部联接对于大型数据集可能会占用大量内存
请注意,我已将finished_a
的名称更改为finished_at
df1 %>% inner_join(df2, by="id_user") %>%
filter(tracked_at >= started_at, tracked_at <=finished_at)
ID tracked_at lat long id_user started_at finished_at Mode
1 2 2017-10-16 06:00:07 1 2 1 2017-10-16 06:00:05 2017-10-16 06:02:10 Walk
2 3 2017-10-16 06:01:15 1 2 1 2017-10-16 06:00:05 2017-10-16 06:02:10 Walk
3 4 2017-10-16 06:02:09 1 2 1 2017-10-16 06:00:05 2017-10-16 06:02:10 Walk
4 7 2017-10-16 06:04:15 1 2 1 2017-10-16 06:04:15 2017-10-16 06:18:30 Train
5 8 2017-10-16 06:10:15 1 2 1 2017-10-16 06:04:15 2017-10-16 06:18:30 Train
6 9 2017-10-16 06:14:15 1 2 1 2017-10-16 06:04:15 2017-10-16 06:18:30 Train
7 10 2017-10-16 06:16:15 1 2 1 2017-10-16 06:04:15 2017-10-16 06:18:30 Train
8 11 2017-10-16 06:18:30 1 2 1 2017-10-16 06:04:15 2017-10-16 06:18:30 Train
9 12 2017-10-16 06:18:45 1 2 1 2017-10-16 06:18:31 2017-10-16 06:19:45 Walk
10 13 2017-10-16 06:19:15 1 2 1 2017-10-16 06:18:31 2017-10-16 06:19:45 Walk
11 14 2017-10-16 06:19:40 1 2 1 2017-10-16 06:18:31 2017-10-16 06:19:45 Walk
12 17 2017-10-17 08:10:02 1 2 2 2017-10-17 08:10:02 2017-10-17 08:16:02 Car
13 18 2017-10-17 08:16:02 1 2 2 2017-10-17 08:10:02 2017-10-17 08:16:02 Car
中讨论了类似的问题