Question

我有一个评论数据框，看起来像这样（df1）

Comments
Apple laptops are really good for work,we should buy them
Apple Iphones are too costly,we can resort to some other brands
Google search is the best search engine 
Android phones are great these days
I lost my visa card today

我有另一个merchent名称的数据框，看起来像这样（df2）：

Merchant_Name
Google
Android
Geoni
Visa
Apple
MC
WallMart

如果df2中的商家名称出现在df 1的注释中，请将该商家名称附加到R中df1中的第二列。匹配不一定是完全匹配。近似值是必需的。此外，df1包含大约500K行！我的最终输出df可能看起来像这样

Comments                                                        Merchant
Apple laptops are really good for work,we should buy them       Apple
Apple Iphones are too costly,we can resort to some other brands Apple
Google search is the best search engine                         Google
Android phones are great these days                             Android
I lost my visa card today                                       Visa

我怎样才能在R中有效地做到这一点。感谢

Answer 1

这是regex的工作。查看grepl内的lapply命令。

comments = c(
   'Apple laptops are really good for work,we should buy them',
   'Apple Iphones are too costly,we can resort to some other brands',
   'Google search is the best search engine ',
   'Android phones are great these days',
   'I lost my visa card today'
)

brands = c(
   'Google',
   'Android',
   'Geoni',
   'Visa',
   'Apple',
   'MC',
   'WallMart'
)

brandinpattern = lapply(
   brands,
   function(brand) {
      commentswithbrand = grepl(x = tolower(comments), pattern = tolower(brand))
      if ( sum(commentswithbrand) > 0) {
         data.frame(
            comment = comments[commentswithbrand],
            brand = brand
         )
      } else {
         data.frame()
      }
   }
)

brandinpattern = do.call(rbind, brandinpattern)


> do.call(rbind, brandinpattern)
                                                          comment   brand
1                        Google search is the best search engine   Google
2                             Android phones are great these days Android
3                                       I lost my visa card today    Visa
4       Apple laptops are really good for work,we should buy them   Apple
5 Apple Iphones are too costly,we can resort to some other brands   Apple

Answer 2

试试这个

final_df <- data.frame(Comments = character(), Merchant_Name = character(), stringsAsFactors = F)

for(i in df1$Comments){
    for(j in df2$Merchant_Name){ 
        if(grepl(tolower(j),tolower(i))){ 
            final_df[nrow(final_df) + 1,] <- c(i, j)
            break
        }
    }
}


final_df

##                                                        comments  brands
##1       Apple laptops are really good for work,we should buy them   Apple
##2 Apple Iphones are too costly,we can resort to some other brands   Apple
##3                        Google search is the best search engine   Google
##4                             Android phones are great these days Android
##5                                       I lost my visa card today    Visa

根据r中另一个数据框中的列填充数据框中的列

2 个答案: