我有一个稀疏矩阵,我想计算列之间的余弦相似度。数据集约为5,000行乘50,000列。行是单词,列是分配给每个单词的分数。
我应用以下代码:
i1 <- seq_len(ncol(mat))
t1 <- Sys.time()
cosine_dist_mat <- sapply(i1, function(i) sapply(i1, function(j) cosine(mat[, i], mat[, j])))
t2 <- Sys.time()
t2 - t1
对于10x10
,100x100
和1000x1000
的矩阵。在以下时间(以分钟为单位)
plot(c(0.04010415, 3.540563, 12.94297))
因此,对于10x10
,时间花费不到一分钟,而对于1000x1000
,时间花费将近13分钟。因此,增加呈指数级。
> head(mat)
6 x 10 sparse Matrix of class "dgCMatrix"
[[ suppressing 10 column names ‘20_2005’, ‘1750_2005’, ‘2034_2005’ ... ]]
“account . . . . . . . . . .
“amend . . . . . . . . . .
“anticipate” . . . . . . . . . .
“anticipates” . . . . . . . . . .
“asc . . . . . . . . . .
“asc” . . . . . . . . . .
我已经在一个CPU上运行了2天的代码(我之前没有计算过计算该函数所需的时间)。
现在,我对foreach
包进行了研究,以并行运行该过程。
有人知道如何合并我拥有的sapply
函数的并行处理吗?
数据:
mat <- new("dgCMatrix", i = c(47L, 53L, 55L, 69L, 71L, 76L, 84L, 87L,
90L, 97L, 47L, 49L, 50L, 52L, 56L, 61L, 62L, 63L, 69L, 71L, 76L,
79L, 81L, 84L, 87L, 96L, 97L, 99L, 50L, 61L, 62L, 67L, 69L, 71L,
76L, 77L, 81L, 84L, 87L, 96L, 99L, 48L, 49L, 50L, 55L, 59L, 61L,
62L, 63L, 66L, 68L, 69L, 71L, 76L, 81L, 84L, 87L, 97L, 99L, 47L,
49L, 50L, 51L, 53L, 56L, 61L, 62L, 63L, 69L, 71L, 77L, 78L, 81L,
96L, 97L, 99L, 49L, 56L, 61L, 62L, 69L, 71L, 75L, 78L, 81L, 84L,
87L, 99L, 46L, 49L, 62L, 63L, 66L, 67L, 69L, 71L, 76L, 77L, 78L,
81L, 84L, 87L, 96L, 97L, 99L, 49L, 50L, 51L, 53L, 61L, 62L, 69L,
71L, 75L, 76L, 77L, 78L, 84L, 87L, 96L, 97L, 99L, 48L, 49L, 50L,
51L, 56L, 57L, 62L, 63L, 66L, 67L, 69L, 71L, 77L, 79L, 81L, 84L,
87L, 96L, 97L, 99L, 49L, 50L, 51L, 52L, 55L, 61L, 62L, 63L, 68L,
69L, 71L, 76L, 79L, 81L, 84L, 87L, 90L, 96L, 97L, 99L), p = c(0L,
10L, 28L, 41L, 59L, 76L, 88L, 105L, 122L, 142L, 162L), Dim = c(100L,
10L), Dimnames = list(Terms = c("“account", "“amend", "“anticipate”",
"“anticipates”", "“asc", "“asc”", "“asu", "“asu”",
"“believe”", "“believes”", "“busi", "“business”",
"“cautionari", "“company”", "“continue”", "“credit",
"“critic", "“disclosur", "“estimate”", "“estimates”",
"“expect”", "“expects”", "“fair", "“fasb”", "“forwardlook",
"“gaap”", "“incom", "“intend”", "“intends”", "“liquid",
"“note", "“plan”", "“plans”", "“potential”", "“project”",
"“result", "“risk", "“sec”", "“secur", "“select",
"“sfas", "“special", "“summari", "“well", "“will”",
"•chang", "aaa", "abandon", "abat", "abil", "abl", "abnorm",
"abroad", "absenc", "absent", "absolut", "absorb", "absorpt",
"abstract", "abus", "academ", "acceler", "accept", "access",
"accessori", "accid", "accommod", "accompani", "accomplish",
"accord", "accordion", "account", "accounting", "accounts", "accredit",
"accret", "accru", "accrual", "accumul", "accur", "accuraci",
"achiev", "acid", "acknowledg", "acquir", "acquire", "acquired",
"acquisit", "acquisition", "acquisitiond", "acquisitionrel",
"acquisitions", "acquisitionsu", "acquisitionu", "acr", "acreag",
"across", "act", "act”", "action"), Docs = c("20_2005", "1750_2005",
"2034_2005", "2062_2005", "2488_2005", "2969_2005", "3133_2005",
"3327_2005", "3333_2005", "3453_2005")), x = c(0.00113980515407692,
0.00682355899898636, 0.00347759367109875, 5.20001257200727e-05,
2.47397153291907e-05, 0.000108319778164461, 0.000396999848727827,
0.000603493824599814, 0.00398763664820086, 0.000273465937531601,
0.000419330111519823, 0.000528449298979236, 0.000920819686932983,
0.000666064278916234, 0.000390540724451623, 0.000336326269937498,
0.000140334159251127, 0.000600340625202571, 0.000133914580991972,
1.90307227407633e-05, 3.98504468022793e-05, 0.00116492829619315,
0.00092504605067629, 0.000876328679046271, 0.000602634269230779,
0.000416814888082897, 0.000503035548101499, 0.00126018557913386,
0.000156338718399063, 0.000742328431697948, 4.4248714784307e-05,
0.000355437905498253, 0.000126673679410513, 1.18706939075604e-05,
8.79566133287654e-05, 0.000313662207759417, 0.000157056279394439,
0.000161183685832125, 0.000280024170106673, 0.000459990149203084,
0.000309048969614686, 0.00504421844906945, 0.000381235929496257,
0.000547501060433858, 0.00384577193682792, 0.00114672352310616,
0.000649911947127846, 0.000335746257334945, 0.000193348235300333,
0.000611910514753471, 0.000569469982715089, 7.39355915972189e-05,
6.92856464586234e-06, 5.13376122933583e-05, 9.16689953676532e-05,
0.001223014488113, 0.00114409141206702, 0.000388823403657316,
0.000360765054071308, 0.000626141368137108, 0.000220941707332476,
0.0006345982091375, 0.000864881880996905, 0.000535494102642698,
0.00116630643401737, 0.00100440099584482, 8.98055051439052e-05,
0.000448212625430376, 0.000142828928897222, 2.53278109170883e-05,
0.000424397908499662, 0.000822198587001557, 0.00031875543901768,
0.000622385650623143, 0.000901355827278763, 0.00188170203128431,
7.06152424069764e-05, 0.00130467236782308, 0.000374519705191304,
6.69731142246855e-05, 0.000106515721439292, 1.98097861862727e-05,
0.00765000679742822, 0.000613160627451439, 0.000316952275815201,
0.000243961286898647, 0.000423833569441392, 0.000155921454773087,
0.000439822416240934, 0.000362718010375718, 2.75208058908751e-05,
0.000618094368395823, 0.000326025252263801, 0.000110533578784822,
2.62618681659884e-05, 2.1297298021434e-05, 2.73526236190051e-05,
0.000292626693581789, 9.44857208935703e-05, 0.000537252543972119,
0.000225561039284774, 0.000957896738312893, 0.000143047088143026,
0.000483384262049943, 0.00124939896768756, 0.000240145147451988,
0.000241414302537583, 0.0006580379618847, 0.000407426095570545,
0.000382094941946792, 2.27759161373593e-05, 4.34680662572646e-05,
7.0501638013349e-06, 0.0003716542830774, 4.52734606794179e-05,
0.000161449754511765, 0.00015639068592562, 0.000414826298245946,
0.000504473964058679, 0.000710305176997578, 0.0012572795603698,
0.000318150411761914, 0.00105877105901901, 0.000228630388793112,
0.000574596722388701, 0.000783106991035519, 0.000528015872592884,
0.00316437784917878, 0.000162628725278892, 0.000202916981010363,
0.000642193781129678, 0.000217725395222404, 5.17297611149423e-05,
1.78989716073174e-05, 0.000192135467529887, 0.000787498706679388,
0.000577233997404633, 0.000493669655913719, 0.000514590921495855,
0.00112707773126517, 0.00108817641693684, 0.000567928811290766,
0.000525119733117927, 0.00145626197138337, 0.000496177965403544,
0.000570575561526461, 0.000365325543047343, 0.000144054828039969,
0.000120215487079327, 0.00102854844547273, 0.000378673914811766,
9.83282025878094e-05, 1.38216242020314e-05, 0.000102412147510543,
0.000498960564151545, 0.000487648633239346, 0.000187673976805267,
0.0002445342839795, 0.000418906192545259, 0.000178529607688508,
0.000775654300853724, 0.000959575180423929), factors = list())
编辑:
我已经设置了具有4个CPU的AWS t2.large实例。 (我以后总是可以添加更多的CPU)。