Question

我有一个如下所示的矩阵：

             [,1]         [,2]        [,3]        [,4]       [,5]        [,6]        [,7]        [,8]         [,9]       [,10]
the    0.07200378  0.173467875 -0.32174805 -0.17641919 -0.1895841  0.41491635  0.52559372  0.46668538 -0.622698039  0.07609943
dog   -0.03110763 -0.307907604 -0.51872045 -0.61390705  0.2901446 -0.30045110  0.37480375  0.43265162 -0.095877141  0.13267635
went  -0.10276563  0.006781152 -0.22007612  0.29408635 -0.5130759 -0.54109880  0.27203657 -0.10996491 -0.442054480  0.14811820
to    -0.25024018 -0.690325871  0.04050764  0.19626275  0.1937401  0.22256489 -0.28244329  0.01593702  0.357230552  0.56581933
play  -0.30871394 -0.093627274 -0.28149478  0.09634858 -0.0895794  0.40877385 -0.60633919  0.15760252  0.001222108  0.82736039
with  -0.15535758  0.103512824 -0.22533448  0.18746118  0.4194084  0.64124607 -0.03984496  0.16687895 -0.373183180 -0.58537456
his    0.56851056 -0.376888059  0.48226617  0.06921187  0.5648746 -0.20768129 -0.28356166  0.70855895  0.031217873  0.71860737
owner -0.29910484 -0.727676094 -0.29929429 -0.23175114  0.4336813  0.39667153 -0.29670753 -0.04054499 -0.041433528  0.34875186
fun    0.08032176 -0.431446284  0.15740608  0.16003107 -0.1894946  0.37010769  0.26229681 -0.22716813 -0.310652746  0.06291729
john   0.08629179  0.470551208  0.31550134  0.61767611  0.6179546 -0.01474994  0.58974983 -0.39419778 -0.689627200 -0.18293759

使用单词作为行名，并使用word2vec模型中的单词向量填充单词。我还有第二个矩阵，如下所示：

                 dog play owner went fun NA_TEXT NA_2TEXT
1750_10-K_2005     0    1     0    0   1       0        0
1800_10-K_2005     1    0     1    0   0       0        1
1923_10-K_2005     1    0     0    0   0       0        0
2135_10-K_2005     0    0     0    0   0       1        0
2488_10-K_2005     0    0     0    0   0       0        0
2491_10-K_2005     0    0     1    0   0       0        1
2969_10-K_2005     1    1     0    1   0       0        1
3133_10-K_2005     0    0     0    0   0       0        0
3197_10-K/A_2005   0    0     0    1   0       1        0
3197_10-K_2005     0    0     0    0   0       0        0

这是一袋单词矩阵。这里的行名是文档。我想基于第一个矩阵中的单词嵌入矩阵来计算文档之间的余弦相似度。我可以通过执行rowMeans(wrds)来平均单词嵌入矩阵词，得出：

>   rowMeans(wrds)
         the          dog         went           to         play         with 
 0.041831714 -0.063769466 -0.120801359  0.036905292  0.011155287  0.013941266 
         his        owner          fun         john 
 0.227511640 -0.075740768 -0.006568109  0.141621237

现在，当docs中存在一个单词时，我想将这些单词与colnames矩阵“连接”起来。

预期的输出（对于docs矩阵的前几列）：

                 dog         play       owner    went 
1750_10-K_2005     0         0.011        0       0  
1800_10-K_2005     -0.063    0          -0.12     0  
1923_10-K_2005     -0.063    0           0        0   
2135_10-K_2005     0         0           0        0    
2488_10-K_2005     0         0           0        0  
2491_10-K_2005     0         0          -0.075    0   
2969_10-K_2005     -0.063    -0.075      0       -0.121   
3133_10-K_2005     0         0           0        0  
3197_10-K/A_2005   0         0           0       -0.12   
3197_10-K_2005     0         0           0        0

数据：

wrds <- structure(c(0.0720037762075663, -0.031107634305954, -0.102765634655952, 
-0.250240176916122, -0.30871394276619, -0.155357576906681, 0.568510562181473, 
-0.299104837700725, 0.0803217552602291, 0.0862917900085449, 0.173467874526978, 
-0.307907603681087, 0.00678115151822567, -0.690325871109962, 
-0.0936272740364075, 0.103512823581696, -0.376888059079647, -0.727676093578339, 
-0.43144628405571, 0.470551207661629, -0.321748048067093, -0.51872044801712, 
-0.220076121389866, 0.0405076444149017, -0.281494781374931, -0.225334476679564, 
0.482266165316105, -0.299294285476208, 0.157406084239483, 0.315501344390213, 
-0.17641919106245, -0.613907054066658, 0.294086349196732, 0.196262747049332, 
0.0963485836982727, 0.18746118247509, 0.0692118704319, -0.231751143932343, 
0.16003106534481, 0.617676109075546, -0.189584106206894, 0.290144592523575, 
-0.513075917959213, 0.193740077316761, -0.0895793968811631, 0.419408403337002, 
0.564874619245529, 0.433681339025497, -0.189494623802602, 0.617954611778259, 
0.414916351437569, -0.300451099872589, -0.541098803281784, 0.222564890980721, 
0.408773854374886, 0.641246065497398, -0.207681285217404, 0.396671526134014, 
0.370107688009739, -0.0147499442100525, 0.525593716651201, 0.374803751707077, 
0.272036574780941, -0.282443292438984, -0.606339186429977, -0.0398449599742889, 
-0.283561661839485, -0.296707525849342, 0.262296808883548, 0.589749827980995, 
0.466685384511948, 0.432651624083519, -0.109964912757277, 0.015937015414238, 
0.157602518796921, 0.166878946125507, 0.708558946847916, -0.0405449904501438, 
-0.227168127894402, -0.394197784364223, -0.622698038816452, -0.0958771407604218, 
-0.442054480314255, 0.357230551540852, 0.00122210755944252, -0.37318317964673, 
0.0312178730964661, -0.0414335280656815, -0.310652745887637, 
-0.689627200365067, 0.0760994255542755, 0.132676348090172, 0.148118201643229, 
0.565819330513477, 0.827360391616821, -0.585374563932419, 0.718607366085052, 
0.348751857876778, 0.0629172921180725, -0.18293759226799), .Dim = c(10L, 
10L), .Dimnames = list(c("the", "dog", "went", "to", "play", 
"with", "his", "owner", "fun", "john"), NULL))

docs <- structure(c(0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 
0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 
1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 
0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0), .Dim = c(10L, 7L), .Dimnames = list(
    c("1750_10-K_2005", "1800_10-K_2005", "1923_10-K_2005", "2135_10-K_2005", 
    "2488_10-K_2005", "2491_10-K_2005", "2969_10-K_2005", "3133_10-K_2005", 
    "3197_10-K/A_2005", "3197_10-K_2005"), c("dog", "play", "owner", 
    "went", "fun", "NA_TEXT", "NA_2TEXT")))

根据行名填写矩阵

0 个答案: