Question

我对textreuse包中的LSH方法是陌生的，发现它非常有用。特别是对于非常大的数据集。

因为我确实有非常大的数据集，所以不能进行简单的逐对比较。逐对比较会杀死我的R会话...

但是，我想知道在已经知道参考数据的情况下，有没有更简单的方法可以使用textreuse包进行记录链接。

示例：我有一条短信列表，我想将其与参考词链接消息

    **message**
1              this is apple
2       this apple is delicious
3  this pineapple looks good
4       can i find stawberry
5    I like to eat chocolate
6               food is good

我想找到“链接”或与提供的术语列表匹配

7                   apple
8                 chocolate
9                     food

所以理想的结果应该是

                   message      ---------     search_term

1              this is apple    ---------     7    apple

2       this apple is delicious ---------     7    apple

3  this pineapple looks good    ---------     NA

4       can i find stawberry    ---------     NA

5    I like to eat chocolate    ---------     8    chocolate

6               food is good    ---------     9     food

为此，我尝试了以下代码，但未成功，

docu <- data.frame(message= c("this is apple", "this apple is delicious", 
                                  "this pineapple looks good", "can i find stawberry",
                                  "I like to eat chocolate", "food is good"))

search_docu <- data.frame(message= c("apple", "chocolate", "food"))

dat<- rbind(docu, search_docu)%>% rowid_to_column()


minhash <- minhash_generator(n = 240, seed = 02082018)
# build the corpus using textreuse
docs <- apply(dat, 1, function(x) paste(x[-1], collapse = " ")) 
names(docs)<- dat$rowid
corpus <- TextReuseCorpus(text = docs, 
                          tokenizer =  tokenize_words,
                          progress = FALSE, 
                          keep_tokens = TRUE, 
                          minhash_func = minhash, 
                          skip_short = F  
) 

buckets <- lsh(corpus, bands = 10, progress = FALSE)

# grab candidate pairs
candidates <- lsh_candidates(buckets)
# get Jaccard similarities only for candidates
lsh_jaccard <- lsh_compare(candidates, corpus, jaccard_similarity, progress = FALSE)

lsh_df<- lsh_jaccard %>% mutate(a=as.numeric(a),b=as.numeric(b)) %>% as.data.frame()%>% 
  left_join(dat, by = c("a"="rowid")) %>%
  left_join(dat, by = c("b"="rowid"))

现在我有两个问题： 1.此代码有什么问题？ 2.即使上述方法是正确的，但这也不是最有效的方法，因为在“这是苹果”和“这苹果好吃”之间进行比较是没有意义的。在非常大的数据集中，计算非常昂贵。

有什么建议吗？非常感谢！

如何在R中使用LSH进行名称匹配？

0 个答案: