Question

数据集具有以下结构

    Key         Date         Mat    Amount
     <int>     <date>       <chr>  <dbl>
1  1001056    2014-12-12    10025  0.10
2  1001056    2014-12-23    10025  0.20
3  1001056    2015-01-08    10025  0.10
4  1001056    2015-04-07    10025  0.20
5  1001056    2015-05-08    10025  0.20
6  1001076    2013-10-29    10026  3.00
7  1001140    2013-01-18    10026  0.72
8  1001140    2013-04-11    10026  2.40
9  1001140    2014-10-08    10026  0.24
10 1001237    2015-02-17    10025  2.40
11 1001237    2015-02-17    10026  3.40

垫子的值是{10001，...，11000}，因此A：= |垫子== 1000。

我想实现以下目标：

1）（中间步骤），我想为每个关键日期组合计算出所有可用的材料（这种组合可能因键而异），数量差异，例如对于组合“ 1001237 2015-02-17”，这将适用于物料10025和10026 2.40-3.40 = -1（但可能是更多组合）。（如何有效地存储这些值？）可能会跳过此步骤。

2）最后，我想构建一个尺寸为A = 1000的新矩阵，其中每个条目（i，j）（材料组合i和j）都包含在上一步中计算出的值的平均值。更正式地说，条目（i，j）由

给出

1 / |所有包含Mat i和Mat j的关键日期组合| \ sum_ {包含Mat i和Mat j的所有关键日期组合} Amount_i-Amount_j

由于表相当大，计算效率非常重要。

非常感谢您的提前帮助！

Answer 1

我可以使用tidyverse中的列表列来完成此操作；诀窍是使用group_by获得Key和Date的不同组合。这是代码：

materials <- unique(x$Mat)
n <- length(materials)

x <- x %>% 
  group_by(Key, Date) %>% 
  nest() %>% 
  # Create a n by n matrix for each combination of Key and Date
  mutate(matrices = lapply(data, 
                       function(y) {
                         out <- matrix(nrow = n, ncol = n, 
                                       dimnames = list(materials, materials))
                         # Only fill in when the pair of materials is present
                         # for the date of interest
                         mat_present <- as.character(unique(y$Mat))
                         for (i in mat_present) {
                           for (j in mat_present) {
                             # You may want to take an absolute value
                             out[i,j] <- y$Amount[y$Mat == i] - y$Amount[y$Mat == j]
                           }
                         }
                         out
                       }))

如果您真的想提高速度，可以在lapply和Rcpp中实现该功能。您可以使用RcppParallel进一步加快速度。现在，数据框的列之一是矩阵列表。然后，对于矩阵的每个元素，取平均值而忽略NA：

x_arr <- array(unlist(x$matrices), dim = c(2,2,10))
results <- apply(x_arr, 2, rowMeans, na.rm = TRUE)

我将矩阵列表堆叠到一个3D数组中，发现行的意思是逐片。为了提高性能，您也可以使用RcppArmadillo在sum(x_arr, 2)中进行操作，但是当并非所有类型的材料都以Key和{{ 1}}。

R：汇总多个变量和观察值（取决于值）并创建一个新变量

1 个答案: