将来自两个数据帧的信息与dplyr组合在一起

时间:2015-04-13 15:00:29

标签: r data.table dplyr

我需要dplyr的一些帮助。 我有两个数据框 - 一个是巨大的,有几个时间序列A,B,...在那里(LargeDF),第二个(Categories)有时间间隔(左边界和右边界)。 / p>

我想在标有LargeDF的{​​{1}}中添加另一列,其中包含适当的边界值,如下所示:

leftBoundary

LargeDF
   ts timestamp   signal     # left_boundary
1   A 0.3209338 10.43279     # 0
2   A 1.4791524 10.34295     # 1
3   A 2.6007494 10.71601     # 2

我提出的代码是

Categories
   ts left right
1   A    0     1
2   A    1     2
3   A    2     3

但是对于大型时间系列来说它超级慢。我真的想丢失LargeDF %>% group_by(ts) %>% do(myFUN(., Categories)) # calls this ... myFUN <- function(Large, Categ) { CategTS <- Categ %>% filter(ts == Large[1, "ts"][[1]]) Large %>% group_by(timestamp) %>% # this is bothering me... mutate(left_boundary = CategTS$left[CategTS$left < timestamp & timestamp < CategTS$right]) } ,因为它们在每个group_by(timestamp)内都是唯一的。

有人看到更好的解决方案吗?非常感谢。

ts

更新(data.table和我稍加修改的模型)

所以,我首先尝试了@DavidArenburg对快速/脏模型示例的建议,但是遇到了一些时间戳被分箱两次(连续类别/间隔)的问题。

# Code for making the example data frames ...
library("dplyr")
n <- 10; series <- c("A", "B", "C")
LargeDF <- data.frame(
    ts        = rep(series, each = n)
  , timestamp = runif(n*length(series), max = 4)
  , signal    = runif(n*length(series), min = 10, max = 11)
) %>% group_by(ts) %>% arrange(timestamp)

m <- 7
Categories <- data.frame(
    ts    = rep(series, each = m)
  , left  = rep(seq(1 : m) - 1, length(series))
  , right = rep(seq(1 : m), length(series))
)

然后,我将> foverlaps(d, c, type="any", by.x = c("timestamp", "timestamp2")) left right value timestamp timestamp2 1: 0.9 1.9 0.1885459 1 1 2: 0.9 1.9 0.0542375 2 2 # binned here 3: 1.9 2.9 0.0542375 2 2 # and here as well 13: 19.9 25.9 0.4579986 20 20 视为默认值,并意识到正常时间戳为minoverlap = 1L

>> 1

因此,如果我将所有内容都移到更大的值(例如下面示例中的> as.numeric(Sys.time()) [1] 1429022267 ),那么一切都很顺利。

n <- 10

凭借我的真实数据,一切进展顺利,再次感谢。

   left right      value timestamp timestamp2
1:    9    19 0.64971126        10         10
2:   19    29 0.75994751        20         20
3:   29    99 0.98276462        30         30
9:  199   259 0.89816165       200        200

更新2(加入,再过滤,在dplyr中)

我测试了来自@aosmith的建议,使用dplyr函数## Code for my data.table example ----- n <- 1 d <- data.table( value = runif(9), timestamp = c(1, 2, 3, 5, 7, 10, 15, 18, 20)*n, timestamp2 = c(1, 2, 3, 5, 7, 10, 15, 18, 20)*n) c <- data.table(left = c(0.9, 1.9, 2.9, 9.9, 19.9, 25.9)*n, right = c(1.9, 2.9, 9.9, 19.9, 25.9, 33.9)*n) setkey(c, left, right) foverlaps(d, c, type="any", by.x = c("timestamp", "timestamp2")) 创建一个(非常)大DF,然后再次left_join()。很快,我遇到了内存问题:

filter()

对于较小的表,这种方法可能是一个好主意 - 因为语法非常好(但这又是个人偏好)。在这种情况下,我会选择Error: std::bad_alloc 解决方案。再次感谢所有建议。

1 个答案:

答案 0 :(得分:5)

dplyr不适合此类操作,请尝试使用data.table s foverlaps函数

library(data.table)
class(LargeDF) <- "data.frame" ## Removing all the dplyr classes
setDT(LargeDF)[, `:=`(left = timestamp, right = timestamp)] # creating min and max boundaries in the large table
setkey(setDT(Categories)) # keying by all columns (necessary for `foverlaps` to work)
LargeDF[, left_boundary := foverlaps(LargeDF, Categories)$left][] # Creating left_boundary 
#    ts  timestamp   signal       left      right left_boundary
# 1:  A 0.46771516 10.72175 0.46771516 0.46771516             0
# 2:  A 0.58841492 10.35459 0.58841492 0.58841492             0
# 3:  A 1.14494484 10.50301 1.14494484 1.14494484             1
# 4:  A 1.18298225 10.82431 1.18298225 1.18298225             1
# 5:  A 1.69822678 10.04780 1.69822678 1.69822678             1
# 6:  A 1.83189609 10.75001 1.83189609 1.83189609             1
# 7:  A 1.90947475 10.94715 1.90947475 1.90947475             1
# 8:  A 2.73305266 10.14449 2.73305266 2.73305266             2
# 9:  A 3.02371968 10.17724 3.02371968 3.02371968             3
# ...