我有一些看起来像这样的数据(行中的单词以及列中的ID和日期):
ID1_2008 ID5_2009 ID3_2004 ID3_2001 ID5_2010 ID4_2002 ID3_2003 ID5_2006 ID2_2009 ID4_2010 ID5_2002
words 1 0.5507386 0.54252808 0.5507386 0.54252808 0.5507386 0.54252808 0.5507386 0.54252808 0.5507386 0.54252808 0.5507386
words 2 0.6746071 0.65482939 0.6746071 0.65482939 0.6746071 0.65482939 0.6746071 0.65482939 0.6746071 0.65482939 0.6746071
words 3 0.6375659 0.09556285 0.6375659 0.09556285 0.6375659 0.09556285 0.6375659 0.09556285 0.6375659 0.09556285 0.6375659
words 4 0.9428080 0.89386178 0.9428080 0.89386178 0.9428080 0.89386178 0.9428080 0.89386178 0.9428080 0.89386178 0.9428080
words 5 0.5492653 0.90967204 0.5492653 0.90967204 0.5492653 0.90967204 0.5492653 0.90967204 0.5492653 0.90967204 0.5492653
words 6 0.2848684 0.99719264 0.2848684 0.99719264 0.2848684 0.99719264 0.2848684 0.99719264 0.2848684 0.99719264 0.2848684
我正在尝试编写一个函数,该函数将通过“ ID”提取所有列并对其进行一些计算。
即函数将使用矩阵的列名中的所有ID1
提取数据:ID1_2001
,ID1_2002
,ID1_2003
... ID1_2010
并执行一些计算。然后移至每年的下一个ID2
,即ID2_2001
,ID2_2002
... ID2_2010
。
我要应用的函数是cosine
相似度函数,但是我只对每个ID
的年份之间的余弦相似度感兴趣。我对将ID1_2001
与ID9_2017
进行比较并不感兴趣,因为它们是两个不同的ID,并且使用了两个不同的年份。
我正在尝试编写以下函数:
1)将列名称拆分为ID
和date
。
2)搜索所有列名并将ID分组在一起。
3)将它们放在临时matrix
中,计算该ID
每年的余弦相似度。
4)移至下一个ID
,然后执行同样的操作。
将数据存储在稀疏矩阵或填充有zeors的矩阵中会很酷,如下所示:
ID1_2001 ID1:2002 ... ID1:2010 ... ID9_2001 ID9_2002 ... ID9_2010
ID1_2001 1.0000 0.0384 ... 0.0934 ... 0.0000 0.0000 ... 0.000
ID1_2002 0.0384 1.0000 ... 0.8736 ... 0.0000 0.0000 ... 0.000
... ... #(1.000 in the diagonal)
ID9_2001 0.0000 0.0000 ... 0.0000 ... 0.9837 0.7463 ... 0.9376
ID9_2002 0.0000 0.0000 ... 0.0000 ... 0.7463 1.0000 ... 0.2635
或者,也可以将数据作为长数据帧存储,如下所示:
ID Score
ID1_2001 0.9423
...
ID1_2010 0.9277
...
ID9_2001 0.9633
---
ID9_2010 0.1827
数据:
cols <- sample(c(paste("ID1", 2001:2010, sep = "_"), paste("ID2", 2001:2010, sep = "_"), paste("ID3", 2001:2010, sep = "_"), paste("ID4", 2001:2010, sep = "_"), paste("ID5", 2001:2010, sep = "_")))
rows <- paste("words", 1:100)
mat <- matrix(data = runif(2000000), nrow = 100, ncol = 50, dimnames = list(rows, cols))
mat
# Funtion I will apply to each "ID"
cosine_dist_mat <- cosine(mat) # here it applys to who whole matrix however I wish to apply it to each of the "groups" or "IDs"
编辑:
我将以下嵌套函数应用于“ ID”嵌套,但是我不确定如何将其扩展到成千上万个ID。但是,在这个嵌套函数中,我想我可以在嵌套上使用余弦函数来获得要使用mutate
查找的结果。这里的任何建议都很好。
seq <- 1:10
myrows <- row.names(mat)
nested_results <- mat %>%
as_tibble() %>%
t() %>%
data.frame() %>%
rownames_to_column("Id_col") %>%
group_by(grp = sub('_.*.', '', Id_col)) %>%
nest(.key = "data")
nested_results
输出
# A tibble: 5 x 2
grp data
<chr> <list>
1 ID3 <tibble [10 × 101]>
2 ID1 <tibble [10 × 101]>
3 ID5 <tibble [10 × 101]>
4 ID2 <tibble [10 × 101]>
5 ID4 <tibble [10 × 101]>
x <- nested_results$data[[3]]
# A tibble: 10 x 101
Id_col X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ID5_2… 0.274 0.839 0.660 0.866 0.238 0.861 0.451 0.359 0.924 0.594 0.241 0.735 0.427 0.688 0.803 0.645 0.870
2 ID5_2… 0.0916 0.585 0.548 0.679 0.192 0.841 0.170 0.197 0.910 0.895 0.978 0.263 0.841 0.335 0.344 0.204 0.170
3 ID5_2… 0.184 1.000 0.193 0.0191 0.464 0.585 0.822 0.110 0.489 0.0534 0.979 0.118 0.0223 0.342 0.893 0.368 0.992
4 ID5_2… 0.587 0.548 0.582 0.246 0.0497 0.891 0.583 0.558 0.532 0.520 0.0732 0.624 0.887 0.0541 0.681 0.0579 0.607
5 ID5_2… 0.580 0.533 0.0317 0.245 0.102 0.502 0.299 0.0151 0.0116 0.146 0.298 0.940 0.231 0.278 0.574 0.919 0.841
6 ID5_2… 0.275 0.470 0.980 0.958 0.366 0.837 0.889 0.853 0.620 0.219 0.414 0.00662 0.0317 0.927 0.658 0.120 0.270
7 ID5_2… 0.269 0.366 0.689 0.0648 0.323 0.208 0.400 0.798 0.872 0.679 0.550 0.843 0.378 0.917 0.577 0.421 0.882
8 ID5_2… 0.204 0.523 0.623 0.00725 0.0118 0.758 0.475 0.440 0.0494 0.727 0.783 0.439 0.963 0.133 0.788 0.236 0.462
9 ID5_2… 0.165 0.107 0.0527 0.544 0.398 0.977 0.0674 0.335 0.536 0.439 0.956 0.456 0.509 0.121 0.210 0.999 0.0239
10 ID5_2… 0.923 0.507 0.367 0.906 0.850 0.796 0.491 0.725 0.904 0.942 0.411 0.949 0.439 0.739 0.199 0.764 0.673
# … with 83 more variables: X18 <dbl>, X19 <dbl>, X20 <dbl>, X21 <dbl>, X22 <dbl>, X23 <dbl>, X24 <dbl>, X25 <dbl>, X26 <dbl>,
# X27 <dbl>, X28 <dbl>, X29 <dbl>, X30 <dbl>, X31 <dbl>, X32 <dbl>, X33 <dbl>, X34 <dbl>, X35 <dbl>, X36 <dbl>, X37 <dbl>,
# X38 <dbl>, X39 <dbl>, X40 <dbl>, X41 <dbl>, X42 <dbl>, X43 <dbl>, X44 <dbl>, X45 <dbl>, X46 <dbl>, X47 <dbl>, X48 <dbl>,
# X49 <dbl>, X50 <dbl>, X51 <dbl>, X52 <dbl>, X53 <dbl>, X54 <dbl>, X55 <dbl>, X56 <dbl>, X57 <dbl>, X58 <dbl>, X59 <dbl>,
# X60 <dbl>, X61 <dbl>, X62 <dbl>, X63 <dbl>, X64 <dbl>, X65 <dbl>, X66 <dbl>, X67 <dbl>, X68 <dbl>, X69 <dbl>, X70 <dbl>,
# X71 <dbl>, X72 <dbl>, X73 <dbl>, X74 <dbl>, X75 <dbl>, X76 <dbl>, X77 <dbl>, X78 <dbl>, X79 <dbl>, X80 <dbl>, X81 <dbl>,
# X82 <dbl>, X83 <dbl>, X84 <dbl>, X85 <dbl>, X86 <dbl>, X87 <dbl>, X88 <dbl>, X89 <dbl>, X90 <dbl>, X91 <dbl>, X92 <dbl>,
# X93 <dbl>, X94 <dbl>, X95 <dbl>, X96 <dbl>, X97 <dbl>, X98 <dbl>, X99 <dbl>, X100 <dbl>