如何在R中使用LSH进行名称匹配?

时间:2019-03-28 15:22:08

标签: r text text-search name-matching

我对textreuse包中的LSH方法是陌生的,发现它非常有用。特别是对于非常大的数据集。

因为我确实有非常大的数据集,所以不能进行简单的逐对比较。逐对比较会杀死我的R会话...

但是,我想知道在已经知道参考数据的情况下,有没有更简单的方法可以使用textreuse包进行记录链接。

示例: 我有一条短信列表,我想将其与参考词链接 消息

    **message**
1              this is apple
2       this apple is delicious
3  this pineapple looks good
4       can i find stawberry
5    I like to eat chocolate
6               food is good

我想找到“链接”或与提供的术语列表匹配

7                   apple
8                 chocolate
9                     food

所以理想的结果应该是

                   message      ---------     search_term

1              this is apple    ---------     7    apple

2       this apple is delicious ---------     7    apple

3  this pineapple looks good    ---------     NA

4       can i find stawberry    ---------     NA

5    I like to eat chocolate    ---------     8    chocolate

6               food is good    ---------     9     food

为此,我尝试了以下代码,但未成功,

docu <- data.frame(message= c("this is apple", "this apple is delicious", 
                                  "this pineapple looks good", "can i find stawberry",
                                  "I like to eat chocolate", "food is good"))

search_docu <- data.frame(message= c("apple", "chocolate", "food"))

dat<- rbind(docu, search_docu)%>% rowid_to_column()


minhash <- minhash_generator(n = 240, seed = 02082018)
# build the corpus using textreuse
docs <- apply(dat, 1, function(x) paste(x[-1], collapse = " ")) 
names(docs)<- dat$rowid
corpus <- TextReuseCorpus(text = docs, 
                          tokenizer =  tokenize_words,
                          progress = FALSE, 
                          keep_tokens = TRUE, 
                          minhash_func = minhash, 
                          skip_short = F  
) 

buckets <- lsh(corpus, bands = 10, progress = FALSE)

# grab candidate pairs
candidates <- lsh_candidates(buckets)
# get Jaccard similarities only for candidates
lsh_jaccard <- lsh_compare(candidates, corpus, jaccard_similarity, progress = FALSE)

lsh_df<- lsh_jaccard %>% mutate(a=as.numeric(a),b=as.numeric(b)) %>% as.data.frame()%>% 
  left_join(dat, by = c("a"="rowid")) %>%
  left_join(dat, by = c("b"="rowid")) 

现在我有两个问题: 1.此代码有什么问题? 2.即使上述方法是正确的,但这也不是最有效的方法,因为在“这是苹果”和“这苹果好吃”之间进行比较是没有意义的。在非常大的数据集中,计算非常昂贵。

有什么建议吗?非常感谢!

0 个答案:

没有答案