我有一个如下所示的矩阵:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
the 0.07200378 0.173467875 -0.32174805 -0.17641919 -0.1895841 0.41491635 0.52559372 0.46668538 -0.622698039 0.07609943
dog -0.03110763 -0.307907604 -0.51872045 -0.61390705 0.2901446 -0.30045110 0.37480375 0.43265162 -0.095877141 0.13267635
went -0.10276563 0.006781152 -0.22007612 0.29408635 -0.5130759 -0.54109880 0.27203657 -0.10996491 -0.442054480 0.14811820
to -0.25024018 -0.690325871 0.04050764 0.19626275 0.1937401 0.22256489 -0.28244329 0.01593702 0.357230552 0.56581933
play -0.30871394 -0.093627274 -0.28149478 0.09634858 -0.0895794 0.40877385 -0.60633919 0.15760252 0.001222108 0.82736039
with -0.15535758 0.103512824 -0.22533448 0.18746118 0.4194084 0.64124607 -0.03984496 0.16687895 -0.373183180 -0.58537456
his 0.56851056 -0.376888059 0.48226617 0.06921187 0.5648746 -0.20768129 -0.28356166 0.70855895 0.031217873 0.71860737
owner -0.29910484 -0.727676094 -0.29929429 -0.23175114 0.4336813 0.39667153 -0.29670753 -0.04054499 -0.041433528 0.34875186
fun 0.08032176 -0.431446284 0.15740608 0.16003107 -0.1894946 0.37010769 0.26229681 -0.22716813 -0.310652746 0.06291729
john 0.08629179 0.470551208 0.31550134 0.61767611 0.6179546 -0.01474994 0.58974983 -0.39419778 -0.689627200 -0.18293759
使用单词作为行名,并使用word2vec
模型中的单词向量填充单词。我还有第二个矩阵,如下所示:
dog play owner went fun NA_TEXT NA_2TEXT
1750_10-K_2005 0 1 0 0 1 0 0
1800_10-K_2005 1 0 1 0 0 0 1
1923_10-K_2005 1 0 0 0 0 0 0
2135_10-K_2005 0 0 0 0 0 1 0
2488_10-K_2005 0 0 0 0 0 0 0
2491_10-K_2005 0 0 1 0 0 0 1
2969_10-K_2005 1 1 0 1 0 0 1
3133_10-K_2005 0 0 0 0 0 0 0
3197_10-K/A_2005 0 0 0 1 0 1 0
3197_10-K_2005 0 0 0 0 0 0 0
这是一袋单词矩阵。这里的行名是文档。我想基于第一个矩阵中的单词嵌入矩阵来计算文档之间的余弦相似度。我可以通过执行rowMeans(wrds)
来平均单词嵌入矩阵词,得出:
> rowMeans(wrds)
the dog went to play with
0.041831714 -0.063769466 -0.120801359 0.036905292 0.011155287 0.013941266
his owner fun john
0.227511640 -0.075740768 -0.006568109 0.141621237
现在,当docs
中存在一个单词时,我想将这些单词与colnames
矩阵“连接”起来。
预期的输出(对于docs
矩阵的前几列):
dog play owner went
1750_10-K_2005 0 0.011 0 0
1800_10-K_2005 -0.063 0 -0.12 0
1923_10-K_2005 -0.063 0 0 0
2135_10-K_2005 0 0 0 0
2488_10-K_2005 0 0 0 0
2491_10-K_2005 0 0 -0.075 0
2969_10-K_2005 -0.063 -0.075 0 -0.121
3133_10-K_2005 0 0 0 0
3197_10-K/A_2005 0 0 0 -0.12
3197_10-K_2005 0 0 0 0
数据:
wrds <- structure(c(0.0720037762075663, -0.031107634305954, -0.102765634655952,
-0.250240176916122, -0.30871394276619, -0.155357576906681, 0.568510562181473,
-0.299104837700725, 0.0803217552602291, 0.0862917900085449, 0.173467874526978,
-0.307907603681087, 0.00678115151822567, -0.690325871109962,
-0.0936272740364075, 0.103512823581696, -0.376888059079647, -0.727676093578339,
-0.43144628405571, 0.470551207661629, -0.321748048067093, -0.51872044801712,
-0.220076121389866, 0.0405076444149017, -0.281494781374931, -0.225334476679564,
0.482266165316105, -0.299294285476208, 0.157406084239483, 0.315501344390213,
-0.17641919106245, -0.613907054066658, 0.294086349196732, 0.196262747049332,
0.0963485836982727, 0.18746118247509, 0.0692118704319, -0.231751143932343,
0.16003106534481, 0.617676109075546, -0.189584106206894, 0.290144592523575,
-0.513075917959213, 0.193740077316761, -0.0895793968811631, 0.419408403337002,
0.564874619245529, 0.433681339025497, -0.189494623802602, 0.617954611778259,
0.414916351437569, -0.300451099872589, -0.541098803281784, 0.222564890980721,
0.408773854374886, 0.641246065497398, -0.207681285217404, 0.396671526134014,
0.370107688009739, -0.0147499442100525, 0.525593716651201, 0.374803751707077,
0.272036574780941, -0.282443292438984, -0.606339186429977, -0.0398449599742889,
-0.283561661839485, -0.296707525849342, 0.262296808883548, 0.589749827980995,
0.466685384511948, 0.432651624083519, -0.109964912757277, 0.015937015414238,
0.157602518796921, 0.166878946125507, 0.708558946847916, -0.0405449904501438,
-0.227168127894402, -0.394197784364223, -0.622698038816452, -0.0958771407604218,
-0.442054480314255, 0.357230551540852, 0.00122210755944252, -0.37318317964673,
0.0312178730964661, -0.0414335280656815, -0.310652745887637,
-0.689627200365067, 0.0760994255542755, 0.132676348090172, 0.148118201643229,
0.565819330513477, 0.827360391616821, -0.585374563932419, 0.718607366085052,
0.348751857876778, 0.0629172921180725, -0.18293759226799), .Dim = c(10L,
10L), .Dimnames = list(c("the", "dog", "went", "to", "play",
"with", "his", "owner", "fun", "john"), NULL))
docs <- structure(c(0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0), .Dim = c(10L, 7L), .Dimnames = list(
c("1750_10-K_2005", "1800_10-K_2005", "1923_10-K_2005", "2135_10-K_2005",
"2488_10-K_2005", "2491_10-K_2005", "2969_10-K_2005", "3133_10-K_2005",
"3197_10-K/A_2005", "3197_10-K_2005"), c("dog", "play", "owner",
"went", "fun", "NA_TEXT", "NA_2TEXT")))