我对textreuse包中的LSH方法是陌生的,发现它非常有用。特别是对于非常大的数据集。
因为我确实有非常大的数据集,所以不能进行简单的逐对比较。逐对比较会杀死我的R会话...
但是,我想知道在已经知道参考数据的情况下,有没有更简单的方法可以使用textreuse包进行记录链接。
示例: 我有一条短信列表,我想将其与参考词链接 消息
**message** 1 this is apple 2 this apple is delicious 3 this pineapple looks good 4 can i find stawberry 5 I like to eat chocolate 6 food is good
我想找到“链接”或与提供的术语列表匹配
7 apple 8 chocolate 9 food
所以理想的结果应该是
message --------- search_term 1 this is apple --------- 7 apple 2 this apple is delicious --------- 7 apple 3 this pineapple looks good --------- NA 4 can i find stawberry --------- NA 5 I like to eat chocolate --------- 8 chocolate 6 food is good --------- 9 food
为此,我尝试了以下代码,但未成功,
docu <- data.frame(message= c("this is apple", "this apple is delicious",
"this pineapple looks good", "can i find stawberry",
"I like to eat chocolate", "food is good"))
search_docu <- data.frame(message= c("apple", "chocolate", "food"))
dat<- rbind(docu, search_docu)%>% rowid_to_column()
minhash <- minhash_generator(n = 240, seed = 02082018)
# build the corpus using textreuse
docs <- apply(dat, 1, function(x) paste(x[-1], collapse = " "))
names(docs)<- dat$rowid
corpus <- TextReuseCorpus(text = docs,
tokenizer = tokenize_words,
progress = FALSE,
keep_tokens = TRUE,
minhash_func = minhash,
skip_short = F
)
buckets <- lsh(corpus, bands = 10, progress = FALSE)
# grab candidate pairs
candidates <- lsh_candidates(buckets)
# get Jaccard similarities only for candidates
lsh_jaccard <- lsh_compare(candidates, corpus, jaccard_similarity, progress = FALSE)
lsh_df<- lsh_jaccard %>% mutate(a=as.numeric(a),b=as.numeric(b)) %>% as.data.frame()%>%
left_join(dat, by = c("a"="rowid")) %>%
left_join(dat, by = c("b"="rowid"))
现在我有两个问题: 1.此代码有什么问题? 2.即使上述方法是正确的,但这也不是最有效的方法,因为在“这是苹果”和“这苹果好吃”之间进行比较是没有意义的。在非常大的数据集中,计算非常昂贵。
有什么建议吗?非常感谢!