R循环到子集大数据帧并给出多行输出

时间:2015-12-23 22:01:19

标签: r loops subset

从我昨天问过here的问题开始,我试图设计一个循环,根据匹配日期,时间和ID的唯一组合,在一秒内对数据df1中的事件进行分组数据集df2。每次迭代的输出将是多行,每次迭代将具有不同的行数,或者可能为空。最后,我需要将所有迭代输出组合成1个数据框,显示每个日期每个事件的日期,时间和ID号。分配一个空矩阵并运行一个常规的FOR循环或嵌套循环并不能让我随处可见。我不知道是否需要从不同类型的结构开始,或者我的尺寸是否错误。也许有更简单的方法。

以下是数据结构的示例(尽管原始数据要长得多)。

dput(df1)
structure(list(Date = c("12-31-2008", "12-31-2008", "12-31-2008", 
"12-31-2008", "12-31-2008", "12-31-2008", "01-01-2009", "01-01-2009", 
"01-01-2009", "01-01-2009", "01-10-2009", "01-10-2009", "01-10-2009", 
"01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", 
"01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", 
"01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", "01-10-2009", 
"01-10-2009", "01-11-2009", "01-11-2009", "01-17-2009", "01-17-2009", 
"01-18-2009", "01-18-2009", "01-18-2009", "01-18-2009", "01-18-2009", 
"01-18-2009", "01-18-2009", "01-18-2009", "01-18-2009", "01-18-2009", 
"01-18-2009", "01-18-2009", "01-19-2009", "01-19-2009", "01-19-2009", 
"01-19-2009", "01-19-2009"), IDNum = c("534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198"), Time = c("19:01", 
"19:53", "20:55", "22:03", "23:04", "23:55", "00:45", "01:48", 
"02:50", "03:50", "02:35", "03:42", "04:49", "05:53", "06:55", 
"07:55", "08:43", "10:23", "10:31", "11:41", "15:27", "16:33", 
"17:41", "18:46", "19:46", "20:48", "21:48", "22:48", "23:48", 
"01:49", "02:49", "21:49", "22:49", "12:04", "13:04", "15:05", 
"16:05", "17:05", "18:07", "18:49", "19:49", "20:49", "21:49", 
"22:50", "23:50", "00:50", "01:50", "03:02", "04:22", "05:25"
)), .Names = c("Date", "IDNum", "Time"), row.names = 8643:8692, class = "data.frame")

dput(df2)
structure(list(Date = c("01-04-2009", "01-05-2009", "01-05-2009", 
"01-06-2009", "01-06-2009", "01-07-2009", "01-07-2009", "01-08-2009", 
"01-08-2009", "01-09-2009", "01-09-2009", "01-10-2009", "01-11-2009", 
"01-12-2009", "01-12-2009", "01-13-2009", "01-14-2009", "01-14-2009", 
"01-21-2009", "01-21-2009", "01-22-2009", "01-22-2009", "01-23-2009", 
"01-23-2009", "01-24-2009", "01-24-2009", "01-25-2009", "01-25-2009", 
"01-26-2009", "01-26-2009", "01-27-2009", "01-28-2009", "01-28-2009", 
"01-28-2009", "01-28-2009", "01-29-2009", "01-29-2009", "01-29-2009", 
"01-29-2009", "02-05-2009", "02-05-2009", "02-05-2009", "02-06-2009", 
"02-06-2009", "02-06-2009", "02-07-2009", "02-07-2009", "02-07-2009", 
"02-08-2009", "02-08-2009"), IDNum = c("599091", "599091", "599091", 
"599091", "599091", "599091", "599091", "599091", "599091", "599091", 
"599091", "599091", "599091", "599091", "599091", "599091", "599091", 
"599091", "534198", "534198", "534198", "534198", "534198", "534198", 
"534198", "534198", "534198", "534198", "534198", "534198", "534198", 
"697345", "697345", "534198", "534198", "697345", "697345", "697345", 
"534198", "697345", "697345", "697345", "697345", "697345", "697345", 
"697345", "697345", "697345", "697345", "697345"), Trip = c("GL0229", 
"GL0229", "GL0229", "GL0229", "GL0229", "GL0229", "GL0229", "GL0229", 
"GL0229", "GL0229", "GL0229", "GL0229", "GL0229", "GL0229", "GL0229", 
"GL0229", "GL0229", "GL0229", "GL0230", "GL0230", "GL0230", "GL0230", 
"GL0230", "GL0230", "GL0230", "GL0230", "GL0230", "GL0230", "GL0230", 
"GL0230", "GL0230", "GL0233", "GL0233", "GL0230", "GL0230", "GL0233", 
"GL0233", "GL0233", "GL0230", "GL0234", "GL0234", "GL0234", "GL0234", 
"GL0234", "GL0234", "GL0234", "GL0234", "GL0234", "GL0234", "GL0234"
), Replicate = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 
12L, 13L, 14L, 15L, 16L, 17L, 18L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 11L, 12L, 13L, 1L, 2L, 14L, 15L, 3L, 4L, 5L, 16L, 
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L), Start = c("12:00", 
"08:35", "15:33", "08:30", "15:51", "10:02", "23:04", "11:17", 
"21:31", "11:16", "20:07", "11:28", "07:37", "08:40", "16:32", 
"09:14", "08:04", "15:15", "07:16", "16:17", "07:10", "16:40", 
"07:00", "16:25", "07:17", "16:50", "07:20", "16:18", "07:20", 
"15:40", "07:10", "09:34", "11:07", "07:55", "16:38", "07:01", 
"08:26", "14:47", "07:18", "07:47", "09:17", "14:58", "07:48", 
"08:59", "14:53", "07:30", "09:12", "13:47", "08:56", "09:53"
), End = c("17:21", "15:08", "22:44", "15:12", "09:06", "19:16", 
"10:28", "20:12", "10:14", "18:48", "10:53", "20:23", "14:07", 
"15:02", "22:27", "18:03", "15:07", "21:19", "16:04", "22:04", 
"16:31", "23:01", "16:15", "22:07", "16:33", "22:37", "16:05", 
"22:17", "15:22", "22:31", "16:05", "16:41", "19:01", "16:20", 
"21:56", "14:31", "19:46", "00:30", "15:10", "14:21", "19:27", 
"23:45", "14:31", "19:20", "23:05", "14:51", "20:15", "00:17", 
"14:31", "18:07")), .Names = c("Date", "IDNum", "Trip", "Replicate", 
"Start", "End"), row.names = 506:555, class = "data.frame")

首先,我找到了2个数据集之间匹配的日期,并创建了一个新变量records,以根据匹配日期显示来自df2的信息。在这个例子中,我只是使用第二个匹配日期:

match_dates <- as.character(intersect(df1$Date, df2$Date))
records <- df2[which(df2$Date == match_dates[2]),]
print(records)

          Date  IDNum   Trip Replicate Start   End
518 01-11-2009 599091 GL0229        13 07:37 14:07

在原始大得多的数据集中,records最终会变得更像这样:

records <- df2[which(df2$Date == match_dates[25]),]
print(records)
#           Date  IDNum   Trip Replicate Start   End
# 659 04-02-2009 507646 GL0247        10 09:43 05:19
# 660 04-02-2009 680845 GL0249         4 05:37 11:29
# 661 04-02-2009 680845 GL0249         5 11:59 16:47

然后records的每次迭代感兴趣的事件被定义为df1Start之间的End次这样的事情(我这样做了保留date-time-ID-replicate的唯一组合:

event1 <- subset(df1, Date==records[1,"Date"] & IDNum==records[1,"IDNum"] & Time >= records[1,"Start"] & Time <= records[1,"End"])
event2 <- subset(df1, Date==records[2,"Date"] & IDNum==records[2,"IDNum"] & Time >= records[2,"Start"] & Time <= records[2,"End"])
event3 <- subset(df1, Date==records[3,"Date"] & IDNum==records[3,"IDNum"] & Time >= records[3,"Start"] & Time <= records[3,"End"])

每个事件的结果如下:

print(event1) #This result is empty
    [1] NewRecNum Date      IDNum     Time      Speed    
    <0 rows> (or 0-length row.names)

print(event2)
            Date  IDNum  Time
80620 04-02-2009 680845 06:35
80621 04-02-2009 680845 07:35
80622 04-02-2009 680845 08:35
80623 04-02-2009 680845 09:35
80624 04-02-2009 680845 10:35

print(event3)
                    Date  IDNum  Time
        80626 04-02-2009 680845 12:35
        80627 04-02-2009 680845 13:35
        80628 04-02-2009 680845 14:35
        80629 04-02-2009 680845 15:35
        80630 04-02-2009 680845 16:35

我的目标是一个循环,它将从match_dates(在这种情况下为147)中匹配日期的每个实例,从records创建147对应的df2,然后使用每个records到子集df1中的日期,IDNum,开始和结束时间,并输出df1个事件。到目前为止我所做的(不起作用):

records <- matrix(ncol=6, nrow=nrow(df1)) # Create an empty matrix to start
event=NULL
for (i in 1:length(match_dates)) 
    { records[i] <- df2[which(df2$Date == match_dates[i]), ]

    for (j in 1:nrow(records[i]))
    { event[j] <- subset(df1, Date==records[i,"Date"] & IDNum==records[i,"IDNum"] & Time >= records[i,"Start"] & Time <= records[i,"End"])
      }
}
print(event)

Error in 1:nrow(records[i]) : argument of length 0
In addition: Warning message:
In records[i] <- df2[which(df2$Date == match_dates[i]), ] :
  number of items to replace is not a multiple of replacement length
> print(event)
NULL

提前感谢您的帮助!我正在敲打我的头撞在墙上。

修改/更新:

我将records更改为

records <- subset(df2, Date %in% df1$Date)

然后编写了一个函数来将df1中的匹配行作为

进行子集化
event_func <- function(df,records,i){
  event_int <- subset(df, Date==records[i,"Date"] & IDNum==records[i,"IDNum"] & Time >= records[i,"Start"] & Time <= records[i,"End"])
  return(event_int)
}

此功能有效,并输出我需要的内容。但是我仍然遇到一个循环,它会占用records的686行,将它们与df1匹配,并输出匹配的所有df1行的最终数据帧。我也试过使用lapply这就是我所拥有的(两者都不起作用):

# First option using a loop
final <- data.frame()
event_int <- data.frame()

for (i in 1:nrow(records)) {
  event_int[i] <- event_func(df1, records,i)
  final <- rbind(event_int, event_int[i])
}

# Second option using lapply
lapply(records, event_func(df1,records,1:nrow(records)))

再次感谢您的帮助!

2 个答案:

答案 0 :(得分:1)

这里有几个问题。

  • records[i]不正确,如果您要分配到您需要的行records[i,]
  • df2[which(df2$Date == match_dates[i]),]不保证具有任何特定的大小,并通过循环将其分配给records[i,]您对其大小做出假设。您可以分配一个中间值,并使用另一个循环将其放入records或更好地使用循环的每次迭代使用rbind函数,这将无需预先指定{{1的大小}}
  • 尝试将data.frame(records)分配给矩阵(df2)而不进行任何转换是一件麻烦事。无论如何,records应该是一个data.frame。

更简单的方法是通过%in%接口使用match()函数,因此

records

答案 1 :(得分:0)

终于有了一些工作!我最后改变了一些原始编码,并从另一篇帖子here找到了一个非常有用的答案。

1)我首先通过匹配recordsdf1

之间的ID和日期来定义df2
records <- subset(df1, IDNum %in% df2$IDNum)
records <- subset(records, Date %in% df2$Date)

# Records looks like:
head(records,5)
               Date  IDNum  Time    Speed
    8653 01-10-2009 534198 02:35 4.001809
    8654 01-10-2009 534198 03:42 4.117383
    8655 01-10-2009 534198 04:49 4.263277
    8656 01-10-2009 534198 05:53 4.310865
    8657 01-10-2009 534198 06:55 4.353049

# df2 looks like:
head(df2)
          Date  IDNum   Trip Replicate Start   End
506 01-04-2009 599091 GL0229         1 12:00 17:21
507 01-05-2009 599091 GL0229         2 08:35 15:08
508 01-05-2009 599091 GL0229         3 15:33 22:44
509 01-06-2009 599091 GL0229         4 08:30 15:12
510 01-06-2009 599091 GL0229         5 15:51 09:06
511 01-07-2009 599091 GL0229         6 10:02 19:16

2)我的功能是根据匹配的ID,日期和时间recordsdf2进行子集化:

event_func <- function(i,...) {
  event_int <- subset(records, Date==df2[i,"Date"] & IDNum==df2[i,"IDNum"] & Time >= df2[i,"Start"] & Time <= df2[i,"End"])
  output <- event_int
  return(output)
}

# For example, subsetting records based on the first row of df2
event_func(1)
            Date  IDNum  Time    Speed
38613 01-04-2009 599091 12:24 1.611527
38614 01-04-2009 599091 15:58 1.545299
38615 01-04-2009 599091 17:02 1.527205

3)我在所有686行event_func上重复df2,并使用foreach包将结果放入单个数据框中。

library(foreach)
final.match <- foreach(i = 1:nrow(df2), .combine=rbind) %do% {
  event_func(i)}

final.match的输出是一个包含4列和1634行的数据框,这正是我所寻找的!