我有一个示例data.table,如下所示
dt = data.table(a= c("A","A","A","A","B","B"), b= c
("D","D","D","D","E","E"), c= c("My name is ABC","I am going to
school","name is Bond","My school is XYZ","My name is ABC set 2","My name
is ABC set 1") )
现在我需要找到组中每一行和另一行#34; c"之间的余弦相似度,按列" a"和专栏" b"并将具有最大余弦值的文本放在新列" d"中,如下所示。
dt2 = data.table(a= c("A","A","A","A","B","B"), b=
c("D","D","D","D","E","E"), c= c("My name is ABC",
"I am going to school","name is Bond","My school is XYZ",
"My name is ABC set 2","My name is ABC set 1"),
d= c("name is Bond" ,"I am going to school","My name is ABC",
"My school is XYZ","My name is ABC set 1","My name is ABC set 2"))
下面是余弦函数,它返回两个字符向量之间的值相似性。已注释掉代码,因为它创建了临时文件
#library(lsa)
#cosine = function(x,y){
#td = tempfile()
#dir.create(td)
#f1 <- unlist(strsplit( as.character(x), split = " "))
#f1 = f1[grepl("[[:alnum:]]",f1 )]
#f2 <- unlist(strsplit( as.character(y), split = " "))
#f2 = f2[grepl("[[:alnum:]]",f2 )]
#write( c(f1), file=paste(td, "D1", sep="/"))
#write( c(f2), file=paste(td, "D2", sep="/"))
#myMatrix = textmatrix(td, minWordLength=1)
#unlink(td, recursive=TRUE)
#res <- lsa::cosine(myMatrix[,1], myMatrix[,2])
#return(res)
#}
我认为应该有点像这样,但是没有想法实现它
testm[, lapply(.SD,MATCH:= cosine(x,y)),
by= .(ColumnA,ColumnB), .SDcols = c ("DESCRIPTION")]