从我昨天问过here的问题开始,我试图设计一个循环,根据匹配日期,时间和ID的唯一组合,在一秒内对数据df1
中的事件进行分组数据集df2
。每次迭代的输出将是多行,每次迭代将具有不同的行数,或者可能为空。最后,我需要将所有迭代输出组合成1个数据框,显示每个日期每个事件的日期,时间和ID号。分配一个空矩阵并运行一个常规的FOR循环或嵌套循环并不能让我随处可见。我不知道是否需要从不同类型的结构开始,或者我的尺寸是否错误。也许有更简单的方法。
以下是数据结构的示例(尽管原始数据要长得多)。
dput(df1)
structure(list(Date = c("12-31-2008", "12-31-2008", "12-31-2008",
"12-31-2008", "12-31-2008", "12-31-2008", "01-01-2009", "01-01-2009",
"01-01-2009", "01-01-2009", "01-10-2009", "01-10-2009", "01-10-2009",
"01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009",
"01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009",
"01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009",
"01-10-2009", "01-11-2009", "01-11-2009", "01-17-2009", "01-17-2009",
"01-18-2009", "01-18-2009", "01-18-2009", "01-18-2009", "01-18-2009",
"01-18-2009", "01-18-2009", "01-18-2009", "01-18-2009", "01-18-2009",
"01-18-2009", "01-18-2009", "01-19-2009", "01-19-2009", "01-19-2009",
"01-19-2009", "01-19-2009"), IDNum = c("534198", "534198", "534198",
"534198", "534198", "534198", "534198", "534198", "534198", "534198",
"534198", "534198", "534198", "534198", "534198", "534198", "534198",
"534198", "534198", "534198", "534198", "534198", "534198", "534198",
"534198", "534198", "534198", "534198", "534198", "534198", "534198",
"534198", "534198", "534198", "534198", "534198", "534198", "534198",
"534198", "534198", "534198", "534198", "534198", "534198", "534198",
"534198", "534198", "534198", "534198", "534198"), Time = c("19:01",
"19:53", "20:55", "22:03", "23:04", "23:55", "00:45", "01:48",
"02:50", "03:50", "02:35", "03:42", "04:49", "05:53", "06:55",
"07:55", "08:43", "10:23", "10:31", "11:41", "15:27", "16:33",
"17:41", "18:46", "19:46", "20:48", "21:48", "22:48", "23:48",
"01:49", "02:49", "21:49", "22:49", "12:04", "13:04", "15:05",
"16:05", "17:05", "18:07", "18:49", "19:49", "20:49", "21:49",
"22:50", "23:50", "00:50", "01:50", "03:02", "04:22", "05:25"
)), .Names = c("Date", "IDNum", "Time"), row.names = 8643:8692, class = "data.frame")
dput(df2)
structure(list(Date = c("01-04-2009", "01-05-2009", "01-05-2009",
"01-06-2009", "01-06-2009", "01-07-2009", "01-07-2009", "01-08-2009",
"01-08-2009", "01-09-2009", "01-09-2009", "01-10-2009", "01-11-2009",
"01-12-2009", "01-12-2009", "01-13-2009", "01-14-2009", "01-14-2009",
"01-21-2009", "01-21-2009", "01-22-2009", "01-22-2009", "01-23-2009",
"01-23-2009", "01-24-2009", "01-24-2009", "01-25-2009", "01-25-2009",
"01-26-2009", "01-26-2009", "01-27-2009", "01-28-2009", "01-28-2009",
"01-28-2009", "01-28-2009", "01-29-2009", "01-29-2009", "01-29-2009",
"01-29-2009", "02-05-2009", "02-05-2009", "02-05-2009", "02-06-2009",
"02-06-2009", "02-06-2009", "02-07-2009", "02-07-2009", "02-07-2009",
"02-08-2009", "02-08-2009"), IDNum = c("599091", "599091", "599091",
"599091", "599091", "599091", "599091", "599091", "599091", "599091",
"599091", "599091", "599091", "599091", "599091", "599091", "599091",
"599091", "534198", "534198", "534198", "534198", "534198", "534198",
"534198", "534198", "534198", "534198", "534198", "534198", "534198",
"697345", "697345", "534198", "534198", "697345", "697345", "697345",
"534198", "697345", "697345", "697345", "697345", "697345", "697345",
"697345", "697345", "697345", "697345", "697345"), Trip = c("GL0229",
"GL0229", "GL0229", "GL0229", "GL0229", "GL0229", "GL0229", "GL0229",
"GL0229", "GL0229", "GL0229", "GL0229", "GL0229", "GL0229", "GL0229",
"GL0229", "GL0229", "GL0229", "GL0230", "GL0230", "GL0230", "GL0230",
"GL0230", "GL0230", "GL0230", "GL0230", "GL0230", "GL0230", "GL0230",
"GL0230", "GL0230", "GL0233", "GL0233", "GL0230", "GL0230", "GL0233",
"GL0233", "GL0233", "GL0230", "GL0234", "GL0234", "GL0234", "GL0234",
"GL0234", "GL0234", "GL0234", "GL0234", "GL0234", "GL0234", "GL0234"
), Replicate = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L,
12L, 13L, 14L, 15L, 16L, 17L, 18L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L, 13L, 1L, 2L, 14L, 15L, 3L, 4L, 5L, 16L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L), Start = c("12:00",
"08:35", "15:33", "08:30", "15:51", "10:02", "23:04", "11:17",
"21:31", "11:16", "20:07", "11:28", "07:37", "08:40", "16:32",
"09:14", "08:04", "15:15", "07:16", "16:17", "07:10", "16:40",
"07:00", "16:25", "07:17", "16:50", "07:20", "16:18", "07:20",
"15:40", "07:10", "09:34", "11:07", "07:55", "16:38", "07:01",
"08:26", "14:47", "07:18", "07:47", "09:17", "14:58", "07:48",
"08:59", "14:53", "07:30", "09:12", "13:47", "08:56", "09:53"
), End = c("17:21", "15:08", "22:44", "15:12", "09:06", "19:16",
"10:28", "20:12", "10:14", "18:48", "10:53", "20:23", "14:07",
"15:02", "22:27", "18:03", "15:07", "21:19", "16:04", "22:04",
"16:31", "23:01", "16:15", "22:07", "16:33", "22:37", "16:05",
"22:17", "15:22", "22:31", "16:05", "16:41", "19:01", "16:20",
"21:56", "14:31", "19:46", "00:30", "15:10", "14:21", "19:27",
"23:45", "14:31", "19:20", "23:05", "14:51", "20:15", "00:17",
"14:31", "18:07")), .Names = c("Date", "IDNum", "Trip", "Replicate",
"Start", "End"), row.names = 506:555, class = "data.frame")
首先,我找到了2个数据集之间匹配的日期,并创建了一个新变量records
,以根据匹配日期显示来自df2
的信息。在这个例子中,我只是使用第二个匹配日期:
match_dates <- as.character(intersect(df1$Date, df2$Date))
records <- df2[which(df2$Date == match_dates[2]),]
print(records)
Date IDNum Trip Replicate Start End
518 01-11-2009 599091 GL0229 13 07:37 14:07
在原始大得多的数据集中,records
最终会变得更像这样:
records <- df2[which(df2$Date == match_dates[25]),]
print(records)
# Date IDNum Trip Replicate Start End
# 659 04-02-2009 507646 GL0247 10 09:43 05:19
# 660 04-02-2009 680845 GL0249 4 05:37 11:29
# 661 04-02-2009 680845 GL0249 5 11:59 16:47
然后records
的每次迭代感兴趣的事件被定义为df1
和Start
之间的End
次这样的事情(我这样做了保留date-time-ID-replicate的唯一组合:
event1 <- subset(df1, Date==records[1,"Date"] & IDNum==records[1,"IDNum"] & Time >= records[1,"Start"] & Time <= records[1,"End"])
event2 <- subset(df1, Date==records[2,"Date"] & IDNum==records[2,"IDNum"] & Time >= records[2,"Start"] & Time <= records[2,"End"])
event3 <- subset(df1, Date==records[3,"Date"] & IDNum==records[3,"IDNum"] & Time >= records[3,"Start"] & Time <= records[3,"End"])
每个事件的结果如下:
print(event1) #This result is empty
[1] NewRecNum Date IDNum Time Speed
<0 rows> (or 0-length row.names)
print(event2)
Date IDNum Time
80620 04-02-2009 680845 06:35
80621 04-02-2009 680845 07:35
80622 04-02-2009 680845 08:35
80623 04-02-2009 680845 09:35
80624 04-02-2009 680845 10:35
print(event3)
Date IDNum Time
80626 04-02-2009 680845 12:35
80627 04-02-2009 680845 13:35
80628 04-02-2009 680845 14:35
80629 04-02-2009 680845 15:35
80630 04-02-2009 680845 16:35
我的目标是一个循环,它将从match_dates
(在这种情况下为147)中匹配日期的每个实例,从records
创建147对应的df2
,然后使用每个records
到子集df1
中的日期,IDNum,开始和结束时间,并输出df1
个事件。到目前为止我所做的(不起作用):
records <- matrix(ncol=6, nrow=nrow(df1)) # Create an empty matrix to start
event=NULL
for (i in 1:length(match_dates))
{ records[i] <- df2[which(df2$Date == match_dates[i]), ]
for (j in 1:nrow(records[i]))
{ event[j] <- subset(df1, Date==records[i,"Date"] & IDNum==records[i,"IDNum"] & Time >= records[i,"Start"] & Time <= records[i,"End"])
}
}
print(event)
Error in 1:nrow(records[i]) : argument of length 0
In addition: Warning message:
In records[i] <- df2[which(df2$Date == match_dates[i]), ] :
number of items to replace is not a multiple of replacement length
> print(event)
NULL
提前感谢您的帮助!我正在敲打我的头撞在墙上。
修改/更新:
我将records
更改为
records <- subset(df2, Date %in% df1$Date)
然后编写了一个函数来将df1
中的匹配行作为
event_func <- function(df,records,i){
event_int <- subset(df, Date==records[i,"Date"] & IDNum==records[i,"IDNum"] & Time >= records[i,"Start"] & Time <= records[i,"End"])
return(event_int)
}
此功能有效,并输出我需要的内容。但是我仍然遇到一个循环,它会占用records
的686行,将它们与df1
匹配,并输出匹配的所有df1
行的最终数据帧。我也试过使用lapply
这就是我所拥有的(两者都不起作用):
# First option using a loop
final <- data.frame()
event_int <- data.frame()
for (i in 1:nrow(records)) {
event_int[i] <- event_func(df1, records,i)
final <- rbind(event_int, event_int[i])
}
# Second option using lapply
lapply(records, event_func(df1,records,1:nrow(records)))
再次感谢您的帮助!
答案 0 :(得分:1)
这里有几个问题。
records[i]
不正确,如果您要分配到您需要的行records[i,]
df2[which(df2$Date == match_dates[i]),]
不保证具有任何特定的大小,并通过循环将其分配给records[i,]
您对其大小做出假设。您可以分配一个中间值,并使用另一个循环将其放入records
或更好地使用循环的每次迭代使用rbind
函数,这将无需预先指定{{1的大小}} records
)分配给矩阵(df2
)而不进行任何转换是一件麻烦事。无论如何,records
应该是一个data.frame。更简单的方法是通过%in%接口使用match()函数,因此
records
答案 1 :(得分:0)
终于有了一些工作!我最后改变了一些原始编码,并从另一篇帖子here找到了一个非常有用的答案。
1)我首先通过匹配records
和df1
df2
records <- subset(df1, IDNum %in% df2$IDNum)
records <- subset(records, Date %in% df2$Date)
# Records looks like:
head(records,5)
Date IDNum Time Speed
8653 01-10-2009 534198 02:35 4.001809
8654 01-10-2009 534198 03:42 4.117383
8655 01-10-2009 534198 04:49 4.263277
8656 01-10-2009 534198 05:53 4.310865
8657 01-10-2009 534198 06:55 4.353049
# df2 looks like:
head(df2)
Date IDNum Trip Replicate Start End
506 01-04-2009 599091 GL0229 1 12:00 17:21
507 01-05-2009 599091 GL0229 2 08:35 15:08
508 01-05-2009 599091 GL0229 3 15:33 22:44
509 01-06-2009 599091 GL0229 4 08:30 15:12
510 01-06-2009 599091 GL0229 5 15:51 09:06
511 01-07-2009 599091 GL0229 6 10:02 19:16
2)我的功能是根据匹配的ID,日期和时间records
对df2
进行子集化:
event_func <- function(i,...) {
event_int <- subset(records, Date==df2[i,"Date"] & IDNum==df2[i,"IDNum"] & Time >= df2[i,"Start"] & Time <= df2[i,"End"])
output <- event_int
return(output)
}
# For example, subsetting records based on the first row of df2
event_func(1)
Date IDNum Time Speed
38613 01-04-2009 599091 12:24 1.611527
38614 01-04-2009 599091 15:58 1.545299
38615 01-04-2009 599091 17:02 1.527205
3)我在所有686行event_func
上重复df2
,并使用foreach
包将结果放入单个数据框中。
library(foreach)
final.match <- foreach(i = 1:nrow(df2), .combine=rbind) %do% {
event_func(i)}
final.match
的输出是一个包含4列和1634行的数据框,这正是我所寻找的!