Question

我有2张桌子。 Table1是一个较小的表，具有大约10K的值。表1（示例）：

KeyWords                         PageView
Phillips Trimmer                123
Buy Samsung Mobile              45
Ripe yellow Banana              63
Pepsi                           140

表2包含一百万个值。

表2（示例）：

KeyWords                         PageView
Electric Trimmer                123
Samsung Mobile                  45
Yellow Ripe Banana              63
Samsung S6                      304
Banana                          105
Phillips                        209
Trimmer Phillips                29

现在，我想从表1中提取所有单词，并查看表2，找到最匹配的单词。单词顺序在比赛中的影响不应该很大，即“成熟的黄色香蕉”应该与“黄色成熟的香蕉”完全匹配。 “ Buy Samsung Mobile”应与“ Samsung Mobile”和“ Samsung S6”匹配。

最终输出应如下所示。

表3：

Word                            PageView   Match
Phillips Trimmer                123        Trimmer Phillips
Buy Samsung Mobile              45         Samsung Mobile
Ripe yellow Banana              63         Yellow Ripe Banana
Pepsi                           140        NA

如果我们能在进行比赛之前将词干和标记化，我们将非常感谢。

我尝试了以下操作，但其无法正常工作，并且循环需要花费大量时间。

file_1$match <- ""
for(i in 1:dim(file_1)[1]) {
print(i)
x <- grep(file_1$Keywords[i],file_2$Keyword,value = T, ignore.case = m 
T,useBytes = T)
x <- paste0(x,"")
file_1$match[i] <- x
}

我尝试使用'agrep'以及更改'max.distance'参数。结果与预期不符。

Answer 1

编辑：我使用“ apply”功能在tab1的每一行上执行以下操作： “ apply”内部的函数，使用x [1]作为关键字（假设为“ Ripe Yellow Banana”），strsplit按空格将其拆分（“ Ripe”“ Yellow”“ Banana”），对这些拆分分别执行grepl查看模式是否存在于tab2中。因此您将在“ Ripe”，“ Yellow”，“ Banana”中有3列错误的判断。下一步是计算每一行的true数，并输出具有该行号的tab2。我还放置了一个if语句，以在最大真值为0时给出NA：

tab1<-data.frame(Keyword=c("Phillips Trimmer",
                 "Buy Samsung Mobile","Ripe Yellow Banana","Pepsi"),
                 PageView=c(123,45,63,140))

tab2<-data.frame(Keyword=c("Electric Trimmer","Samsung Mobile",
                 "Yellow Ripe Banana","Samsung S6","Banana",
                  "Phillips","Trimmer Phillips","Buy Trimmer Philips"),
                 PageView=c(123,45,63,304,105,209,29,21))

tab2$StrLen<-apply(tab2,1,function(x)length(unlist(strsplit(x[1], " "))))
tab1$BestMatch<-apply(tab1,1,function(x){
  a <-sapply(unlist(strsplit(x[1], " ")), grepl, tab2$Keyword)
  a<-cbind(a,TRUECnt=rowSums(a==TRUE))
  a<-as.data.frame(a)
  a$StrLen <- length(unlist(strsplit(x[1], " ")))

  if (max(a$TRUECnt)==0){
    return(NA)
  }
  return(as.character(tab2[which(a$TRUECnt==max(a$TRUECnt) &
                                 tab2$StrLen <= a$StrLen),]$Keyword))

})

View(tab1)    View(tab1)
       #              Keyword PageView          BestMatch
       # 1   Phillips Trimmer      123   Trimmer Phillips
       # 2 Buy Samsung Mobile       45     Samsung Mobile
       # 3 Ripe Yellow Banana       63 Yellow Ripe Banana
       # 4              Pepsi      140               <NA>

R中的匹配句

1 个答案: