Question

有2个数据框：

df_01:
id  n1  n2  n3  n4  n5  n6
1   1   2   3   4   5   6
2   6   5   4   3   2   1
... (2000000 rows)

df_02:
m1  m2  m3  m4  m5
1   2   3   4   5
5   4   3   2   1
... (1200 rows)

我现在需要计算df_02（df_01[x, 2:7]）中每行的df_01（df_02[x,]）行中的许多值，并将此值存储在某处。像这样：

df_01:
id  n1  n2  n3  n4  n5  n6  df02.r1  df02.r2
1   1   2   3   4   5   6   5        2       #one column for each row from df_02
2   6   5   4   3   2   9   4        3
... (2000000 rows)

df_02:
m1  m2  m3  m4  m5
1   2   3   4   5
5   6   7   8   9
... (1200 rows)

现在我正在使用for循环遍历来自df_01和while循环的行，以检查df_02，存储计数和附加到df_01的每一行的交集。

恢复代码版本：

rows <- nrow(df_02)
for (id in df_01$id) {
  df_01_row <- df_01[1,]
  new_row_count <- data.frame(r1 = 0)
  actual_row <- 1 # Actually, this value is computed (last row computed in df_02), df_02 will receive more rows and this function will be used to process update.
  while (actual_row <= rows) {
    new_row_count[, paste0("r", actual_row)] <- length(base::intersect(df_01_row[, 2:7], df_02[actual_row,]))
    # base::intersect running faster than dplyr::intersect in this case
    actual_row <- actual_row + 1
  }
  # append new_row_count to df_01 in database
}

这是一个非常长的操作，我正在使用2台计算机，一台用于赔率行，另一台用于平衡来自df_01的行和一个存储所有计算的公共DB（R mongolite）。我正在使用数据库，因为我需要存储结果以供将来参考，这需要几天才能完成。

我正在寻找一些方法来提高效率（数据架构更改，包，任何东西）。欢迎任何建议。

Answer 1

以下是使用purrr软件包的可能解决方案。

我添加了一个自定义函数counter()来处理数据框每行向量中的值计数（使用intersect()的方法略有不同）。

purrr::by_row()用于执行行式迭代。

不能说我确定这会如何扩展到您必须处理的行数，但可能值得一试！

除此之外 - 我对df_01做了一些小调整，以检查每行的结果是不同的（之前它们似乎相同）。

df_01 <- read.table(text="id  n1  n2  n3  n4  n5  n6
1   1   2   3   4   5   6
2   6   5   8   3   2   1", header=T)

df_02 <- read.table(text="m1  m2  m3  m4  m5
1   2   3   4   5
5   6   7   8   9", header=T)

library(purrr)
counter <- function(vals, df) {
  by_row(df, ~ sum(vals %in% .), .collate = "cols")$.out
}

x <- by_row(df_01[, -1], counter, df_02, .collate = "cols")
x
#>   n1 n2 n3 n4 n5 n6 .out1 .out2
#> 1  1  2  3  4  5  6     5     2
#> 2  6  5  8  3  2  1     4     3

# The rename columns
colnames(x) <- sub("\\.out", "df02.r", colnames(x))
x
#>   n1 n2 n3 n4 n5 n6 df02.r1 df02.r2
#> 1  1  2  3  4  5  6       5       2
#> 2  6  5  8  3  2  1       4       3

Answer 2

这是另一个应该更有效的想法，尽管它利用了更多的内存：

将＆＃34; data.frame＆＃34; s存储为矩阵应该更方便：

m1 = as.matrix(df1[, -1]); m2 = as.matrix(df2) 

m1
#     n1 n2 n3 n4 n5 n6
#[1,]  1  2  3  4  5  6
#[2,]  6  5  4  3  2  9
m2
#  m1 m2 m3 m4 m5
#1  1  2  3  4  5
#2  5  6  7  8  9
#3  1  3  2  5  8

查找所有唯一值：

lvs = union(m1, m2)

并在稀疏矩阵中制表（因为每行不包含重复项，我们不会重复计数，我们可以使用＆＃34;逻辑＆＃34;矩阵）：

tab1 = sparseMatrix(i = row(m1), j = m1, x = TRUE) 
tab2 = sparseMatrix(i = row(m2), j = m2, x = TRUE)

然后：

tcrossprod(tab1, tab2)
#2 x 3 sparse Matrix of class "dgCMatrix"
#          
#[1,] 5 2 4
#[2,] 4 3 3

存储交叉点nrow(df1) * nrow(df2)的位置。

数据是：

df1 = structure(list(id = 1:2, n1 = c(1L, 6L), n2 = c(2L, 5L), n3 = 3:4, 
n4 = c(4L, 3L), n5 = c(5L, 2L), n6 = c(6, 9)), .Names = c("id", 
"n1", "n2", "n3", "n4", "n5", "n6"), row.names = c(NA, -2L), class = "data.frame")


df2 = structure(list(m1 = c(1, 5, 1), m2 = c(2, 6, 3), m3 = c(3, 7, 
2), m4 = c(4, 8, 5), m5 = c(5, 9, 8)), .Names = c("m1", "m2", 
"m3", "m4", "m5"), row.names = c(NA, 3L), class = "data.frame")

检查并计算数据框之间的交叉列值

2 个答案: