根据分组列名将函数和计算应用于一组列

时间:2019-06-05 19:12:10

标签: r

我有一些看起来像这样的数据(行中的单词以及列中的ID和日期):

         ID1_2008   ID5_2009  ID3_2004   ID3_2001  ID5_2010   ID4_2002  ID3_2003   ID5_2006  ID2_2009   ID4_2010  ID5_2002
words 1 0.5507386 0.54252808 0.5507386 0.54252808 0.5507386 0.54252808 0.5507386 0.54252808 0.5507386 0.54252808 0.5507386
words 2 0.6746071 0.65482939 0.6746071 0.65482939 0.6746071 0.65482939 0.6746071 0.65482939 0.6746071 0.65482939 0.6746071
words 3 0.6375659 0.09556285 0.6375659 0.09556285 0.6375659 0.09556285 0.6375659 0.09556285 0.6375659 0.09556285 0.6375659
words 4 0.9428080 0.89386178 0.9428080 0.89386178 0.9428080 0.89386178 0.9428080 0.89386178 0.9428080 0.89386178 0.9428080
words 5 0.5492653 0.90967204 0.5492653 0.90967204 0.5492653 0.90967204 0.5492653 0.90967204 0.5492653 0.90967204 0.5492653
words 6 0.2848684 0.99719264 0.2848684 0.99719264 0.2848684 0.99719264 0.2848684 0.99719264 0.2848684 0.99719264 0.2848684

我正在尝试编写一个函数,该函数将通过“ ID”提取所有列并对其进行一些计算。

即函数将使用矩阵的列名中的所有ID1提取数据:ID1_2001ID1_2002ID1_2003 ... ID1_2010并执行一些计算。然后移至每年的下一个ID2,即ID2_2001ID2_2002 ... ID2_2010。 我要应用的函数是cosine相似度函数,但是我只对每个ID的年份之间的余弦相似度感兴趣。我对将ID1_2001ID9_2017进行比较并不感兴趣,因为它们是两个不同的ID,并且使用了两个不同的年份。

我正在尝试编写以下函数:

1)将列名称拆分为IDdate

2)搜索所有列名并将ID分组在一起。

3)将它们放在临时matrix中,计算该ID每年的余弦相似度。

4)移至下一个ID,然后执行同样的操作。

将数据存储在稀疏矩阵或填充有zeors的矩阵中会很酷,如下所示:

          ID1_2001 ID1:2002 ... ID1:2010 ... ID9_2001 ID9_2002 ... ID9_2010
ID1_2001   1.0000   0.0384  ...  0.0934  ...  0.0000   0.0000  ...  0.000
ID1_2002   0.0384   1.0000  ...  0.8736  ...  0.0000   0.0000  ...  0.000

...                         ... #(1.000 in the diagonal)

ID9_2001   0.0000   0.0000  ...  0.0000  ...  0.9837  0.7463   ...  0.9376
ID9_2002   0.0000   0.0000  ...  0.0000  ...  0.7463  1.0000   ...  0.2635

或者,也可以将数据作为长数据帧存储,如下所示:

    ID     Score
ID1_2001   0.9423
... 
ID1_2010   0.9277
...
ID9_2001   0.9633
---
ID9_2010   0.1827

数据:

cols <- sample(c(paste("ID1", 2001:2010, sep = "_"), paste("ID2", 2001:2010, sep = "_"), paste("ID3", 2001:2010, sep = "_"), paste("ID4", 2001:2010, sep = "_"), paste("ID5", 2001:2010, sep = "_")))
rows <- paste("words", 1:100)

mat <- matrix(data = runif(2000000), nrow = 100, ncol = 50, dimnames = list(rows, cols))
mat
# Funtion I will apply to each "ID"
cosine_dist_mat <- cosine(mat) # here it applys to who whole matrix however I wish to apply it to each of the "groups" or "IDs"

编辑:

我将以下嵌套函数应用于“ ID”嵌套,但是我不确定如何将其扩展到成千上万个ID。但是,在这个嵌套函数中,我想我可以在嵌套上使用余弦函数来获得要使用mutate查找的结果。这里的任何建议都很好。

seq <- 1:10

myrows <- row.names(mat)
nested_results <- mat %>%
  as_tibble() %>%
  t() %>%
  data.frame() %>%
  rownames_to_column("Id_col") %>%
  group_by(grp = sub('_.*.', '', Id_col)) %>%
  nest(.key = "data")

nested_results

输出

    # A tibble: 5 x 2
      grp   data               
      <chr> <list>             
    1 ID3   <tibble [10 × 101]>
    2 ID1   <tibble [10 × 101]>
    3 ID5   <tibble [10 × 101]>
    4 ID2   <tibble [10 × 101]>
    5 ID4   <tibble [10 × 101]>

    x <- nested_results$data[[3]]

# A tibble: 10 x 101
   Id_col     X1    X2     X3      X4     X5    X6     X7     X8     X9    X10    X11     X12    X13    X14   X15    X16    X17
   <chr>   <dbl> <dbl>  <dbl>   <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>
 1 ID5_2… 0.274  0.839 0.660  0.866   0.238  0.861 0.451  0.359  0.924  0.594  0.241  0.735   0.427  0.688  0.803 0.645  0.870 
 2 ID5_2… 0.0916 0.585 0.548  0.679   0.192  0.841 0.170  0.197  0.910  0.895  0.978  0.263   0.841  0.335  0.344 0.204  0.170 
 3 ID5_2… 0.184  1.000 0.193  0.0191  0.464  0.585 0.822  0.110  0.489  0.0534 0.979  0.118   0.0223 0.342  0.893 0.368  0.992 
 4 ID5_2… 0.587  0.548 0.582  0.246   0.0497 0.891 0.583  0.558  0.532  0.520  0.0732 0.624   0.887  0.0541 0.681 0.0579 0.607 
 5 ID5_2… 0.580  0.533 0.0317 0.245   0.102  0.502 0.299  0.0151 0.0116 0.146  0.298  0.940   0.231  0.278  0.574 0.919  0.841 
 6 ID5_2… 0.275  0.470 0.980  0.958   0.366  0.837 0.889  0.853  0.620  0.219  0.414  0.00662 0.0317 0.927  0.658 0.120  0.270 
 7 ID5_2… 0.269  0.366 0.689  0.0648  0.323  0.208 0.400  0.798  0.872  0.679  0.550  0.843   0.378  0.917  0.577 0.421  0.882 
 8 ID5_2… 0.204  0.523 0.623  0.00725 0.0118 0.758 0.475  0.440  0.0494 0.727  0.783  0.439   0.963  0.133  0.788 0.236  0.462 
 9 ID5_2… 0.165  0.107 0.0527 0.544   0.398  0.977 0.0674 0.335  0.536  0.439  0.956  0.456   0.509  0.121  0.210 0.999  0.0239
10 ID5_2… 0.923  0.507 0.367  0.906   0.850  0.796 0.491  0.725  0.904  0.942  0.411  0.949   0.439  0.739  0.199 0.764  0.673 
# … with 83 more variables: X18 <dbl>, X19 <dbl>, X20 <dbl>, X21 <dbl>, X22 <dbl>, X23 <dbl>, X24 <dbl>, X25 <dbl>, X26 <dbl>,
#   X27 <dbl>, X28 <dbl>, X29 <dbl>, X30 <dbl>, X31 <dbl>, X32 <dbl>, X33 <dbl>, X34 <dbl>, X35 <dbl>, X36 <dbl>, X37 <dbl>,
#   X38 <dbl>, X39 <dbl>, X40 <dbl>, X41 <dbl>, X42 <dbl>, X43 <dbl>, X44 <dbl>, X45 <dbl>, X46 <dbl>, X47 <dbl>, X48 <dbl>,
#   X49 <dbl>, X50 <dbl>, X51 <dbl>, X52 <dbl>, X53 <dbl>, X54 <dbl>, X55 <dbl>, X56 <dbl>, X57 <dbl>, X58 <dbl>, X59 <dbl>,
#   X60 <dbl>, X61 <dbl>, X62 <dbl>, X63 <dbl>, X64 <dbl>, X65 <dbl>, X66 <dbl>, X67 <dbl>, X68 <dbl>, X69 <dbl>, X70 <dbl>,
#   X71 <dbl>, X72 <dbl>, X73 <dbl>, X74 <dbl>, X75 <dbl>, X76 <dbl>, X77 <dbl>, X78 <dbl>, X79 <dbl>, X80 <dbl>, X81 <dbl>,
#   X82 <dbl>, X83 <dbl>, X84 <dbl>, X85 <dbl>, X86 <dbl>, X87 <dbl>, X88 <dbl>, X89 <dbl>, X90 <dbl>, X91 <dbl>, X92 <dbl>,
#   X93 <dbl>, X94 <dbl>, X95 <dbl>, X96 <dbl>, X97 <dbl>, X98 <dbl>, X99 <dbl>, X100 <dbl>

0 个答案:

没有答案