快速有效地循环数百万行并匹配列

时间:2015-12-27 17:59:49

标签: r

我现在正在使用眼动追踪数据,所以有一个巨大的数据集(想想数百万行),所以想要一个快速的方法来完成这项任务。这是它的简化版本。

数据告诉您眼睛在每个时间点以及我们正在查看的每个文件的位置。 X1,Y1到我们正在看的点的坐标。每个文件有多个时间点(表示眼睛随时间查看文件中的不同位置)。

Filename    Time    X1    Y1
   1         1      10    10
   1         2      12    10

我还有一个文件,其中列出了每个文件名的项目。每个文件包含(在此简化情况下)两个对象。 X1,Y1是左下坐标,X2,Y2是右上坐标。您可以将此想象为给出项目位于每个文件中的边界框。 E.g。

Filename    Item    X1   Y1   X2   Y2
  1          Dog    11   10   20   20

我想要做的是在第一个数据框中添加另一列,告诉我每个文件在每个文件期间该人正在查看的对象。如果没有查看任何对象,我希望该列说“无”。边境上的东西算在看。例如。

Filename    Time    X1    Y1   LookingAt
   1         1      10    10    none
   1         2      12    11    Dog

我知道如何以for循环方式执行此操作,但它需要永远(并且崩溃了我的RStudio)。我想知道是否可能有更快,更有效的方式我缺席。

这是第一个数据帧的输入(这些包含了上面显示的示例中的更多行):

structure(list(Filename = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 
3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"), Time = structure(c(1L, 
2L, 3L, 1L, 2L, 1L, 2L, 4L, 5L), .Label = c("1", "2", "3", "5", 
"6"), class = "factor"), X1 = structure(c(1L, 4L, 3L, 2L, 1L, 
4L, 6L, 5L, 1L), .Label = c("10", "11", "12", "15", "20", "25"
), class = "factor"), Y1 = structure(c(1L, 5L, 6L, 4L, 1L, 2L, 
3L, 4L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor")), .Names = c("Filename", 
"Time", "X1", "Y1"), row.names = c(NA, -9L), class = "data.frame")

这是第二个的输入:

structure(list(Filename = structure(c(1L, 1L, 2L, 2L), .Label = c("1", 
"3"), class = "factor"), Item = structure(1:4, .Label = c("Cat", 
"Dog", "House", "Mouse"), class = "factor"), X1 = structure(c(2L, 
4L, 3L, 1L), .Label = c("10", "11", "20", "35"), class = "factor"), 
Y1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", 
"13", "35"), class = "factor"), X2 = structure(c(1L, 3L, 
4L, 2L), .Label = c("10", "11", "20", "35"), class = "factor"), 
Y2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", 
"13", "35"), class = "factor")), .Names = c("Filename", "Item", 
"X1", "Y1", "X2", "Y2"), row.names = c(NA, -4L), class = "data.frame")

1 个答案:

答案 0 :(得分:4)

使用 data.table 和您提供的示例数据,我将按如下方式处理:

# getting the data in the right format
datcols <- c("X","Y")
lucols <- c("X1","X2","Y1","Y2")
setDT(dat)[, (datcols) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcol = datcols
           ][, Filename := as.character(Filename)]
setDT(lu)[, (lucols) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcol = lucols
          ][, `:=` (Filename = as.character(Filename),
                    X1 = pmin(X1,X2), X2 = pmax(X1,X2),   # make sure that 'X1' is always the lowest value
                    Y1 = pmin(Y1,Y2), Y2 = pmax(Y1,Y2))]  # make sure that 'Y1' is always the lowest value

# matching the 'Items' to the correct rows
dat[, looked_at := lu$Item[Filename==lu$Filename &
                      between(X, lu$X1, lu$X2) &
                      between(Y, lu$Y1, lu$Y2)],
    by = .(Filename,Time)]

给出:

> dat
   Filename Time  X  Y looked_at
1:        1    1 10 10       Cat
2:        1    2 15 20        NA
3:        1    3 12 25        NA
4:        2    1 11 15        NA
5:        2    2 10 10        NA
6:        3    1 15 11        NA
7:        3    2 25 12        NA
8:        3    5 20 15     House
9:        3    6 10 10     Mouse

使用过的数据:

dat <- structure(list(Filename = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("1", "2", "3"), class = "factor"), 
                     Time = structure(c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 4L, 5L), .Label = c("1", "2", "3", "5", "6"), class = "factor"), 
                     X = structure(c(1L, 4L, 3L, 2L, 1L, 4L, 6L, 5L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor"), 
                     Y = structure(c(1L, 5L, 6L, 4L, 1L, 2L, 3L, 4L, 1L), .Label = c("10", "11", "12", "15", "20", "25"), class = "factor")), 
                .Names = c("Filename", "Time", "X", "Y"), row.names = c(NA, -9L), class = "data.frame")
lu <- structure(list(Filename = structure(c(1L, 1L, 2L, 2L), .Label = c("1", "3"), class = "factor"), 
                     Item = structure(1:4, .Label = c("Cat", "Dog", "House", "Mouse"), class = "factor"), 
                     X1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", "20", "35"), class = "factor"), 
                     X2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", "20", "35"), class = "factor"), 
                     Y1 = structure(c(2L, 4L, 3L, 1L), .Label = c("10", "11", "13", "35"), class = "factor"), 
                     Y2 = structure(c(1L, 3L, 4L, 2L), .Label = c("10", "11", "13", "35"), class = "factor")), 
                .Names = c("Filename", "Item", "X1", "X2", "Y1", "Y2"), row.names = c(NA, -4L), class = "data.frame")