Question

我有这两个数据框：

set.seed(42)
A <- data.table(station = sample(1:10, 1000, replace=TRUE), 
            hash = sample(letters[1:5], 1000, replace=TRUE),
            point = sample(1:24, 1000, replace=TRUE))

B <- data.table(station = sample(1:10, 100, replace=TRUE), 
            card = sample(letters[6:10], 100, replace=TRUE),
            point = sample(1:24, 100, replace=TRUE))

Dataframe A包含超过1M行。

我尝试为每个hash（来自B）找到card（来自A）。我在那里有一些条件：A中的stations和points位于范围内（对于工作站+ - 1，对于点仅为+ 2）。

我使用card对B进行分组，并在实现这些条件后为每个组函数执行绑定行，并通过freq获取最大值。

detect <- function(x){
  am0 <- data.frame(station = 0,
                    hash = 0, 
                    point = 0)
  for (i in 1:nrow(x)) {
        am1 <- A %>%
      filter(station %in% (B$station[i] - 1) : (B$station[i] + 1) &
               point > B$point[i] & point < B$point[i] + 2)
        am0 <- rbind(am0, am1)
  }
  t <- as.data.frame(table(am0$hash))
  t <- t %>%
    arrange(-Freq) %>%
    filter(row_number() == 1)
  return(t)
}

然后只是：

library(dplyr)
B %>% 
  group_by(card) %>%
  do(detect(.)) %>%
  ungroup

但我不知道如何使用索引[i]来实现每个组的功能，所以我实际上得到了错误的结果。

# A tibble: 5 x 3
   card   Var1  Freq
  <chr> <fctr> <int>
1     f      c    46
2     g      c    75
3     h      c    41
4     i      c    64
5     j      c    62

我是初学者，但我知道大数据集的最佳解决方案 - 使用data.table库来连接这些数据集。你能帮我找到决定吗？

Answer 1

我认为你想要做的是：

#### Prepare join limits
B[, point_limit := as.integer(point + 2)]
B[, station_lower := as.integer(station - 1)]
B[, station_upper := as.integer(station + 1)]

## Join A on B, creates All combinations of points in A and B fulfilling the conditions
joined_table <- B[A,
  , on = .( point_limit >= point, point <= point,
            station_lower <= station, station_upper >= station),
  nomatch = 0,
  allow.cartesian=TRUE]


## Count the occurrences of the combinations
counted_table  <- joined_table[,.N, by=.(card,hash)][order(card, -N)]

## Select the top for each group. 
counted_table[, head(.SD, 1 ),by = .(card)][order(card)]

这将创建一个包含所有信息的完整表，然后对其进行计数。它完全依赖于data.tables，因为它充分利用了该软件包的速度提升。如果您不熟悉语法，data.table vignette会很好。无匹配条件确保我们正在进行内连接。

如果A只有1M行且B保持相同的大小，这可能会很好，具体取决于您的数据分布。但是，我们可以使用包do以与purrr语句类似的方式拆分B.我不确定这与R：s garabage集合如何相互作用。

frame_list <- purrr::map(unique(B$card),
            ~ B[card == .x][A,
                            , on = .(point_limit >= point,
                                     point <= point,
                                     station_lower <= station,
                                     station_upper >= station),
                            nomatch = 0,

  allow.cartesian = TRUE][, .N, by = .(card, hash)])
counted_table_mem <- rbindlist(frame_list )

在这里要注意的是我使用rbindlist而不是多个rbind。反复调用rbind会非常慢，因为每次都需要分配新的内存。

使用条件为

1 个答案: