Question

我有一个如下所示的主列：

CompanyName Google Tesco

我还有另一个看起来像这样的数据框：

CompanyVariationsNames google plc tesco bank tesco insurance google finance google play

我需要数据看起来像这样：

Company Name Variation1 Variation2 Variation3 Google google plc google finance google play Tesco tesco bank tesco insurance

这只是一个样本，因为我有750个公司名称，已经返回了大约5000个公司名称变体。我已设法使用以下代码在一列中获取所有匹配的客户变体，因为CompanyVariationsNames列来自超过10万公司名称的池：

matched1 = subset(DF, grepl(paste(Set1, collapse = "|"), DF$Customer_Name, ignore.case = T ))但是我找不到让它们看起来像上面提到的结果的方法。任何建议都会感激不尽！

Answer 1

一个选项可能是

library(dplyr)
library(splitstackshape)

df2 %>%
  rowwise() %>%
  mutate(CompanyName = ifelse(is.null(as.character(Filter(length, 
                                           lapply(df1$CompanyName, function(x) x[grepl(x, CompanyVariationsNames, ignore.case=T)])))),
                              NA,
                              as.character(Filter(length, 
                                                  lapply(df1$CompanyName, function(x) x[grepl(x, CompanyVariationsNames, ignore.case=T)]))))) %>%
  filter(!is.na(CompanyName)) %>%
  group_by(CompanyName) %>%
  summarise(Variation = paste(CompanyVariationsNames, collapse=",")) %>%
  cSplit("Variation", ",")

输出为：

   CompanyName Variation_1     Variation_2 Variation_3
1:      Google  google plc  google finance google play
2:       Tesco  tesco bank tesco insurance          NA

示例数据：

df1 <- structure(list(CompanyName = c("Google", "Tesco")), .Names = "CompanyName", class = "data.frame", row.names = c(NA, 
-2L))

df2 <- structure(list(CompanyVariationsNames = c("google plc", "tesco bank", 
"tesco insurance", "google finance", "google play")), .Names = "CompanyVariationsNames", class = "data.frame", row.names = c(NA, 
-5L))

更新：添加逻辑以处理以下错误

mutate_impl（.data，dots）中的错误：列＆＃34; CompanyName＆＃34;一定是长度1（组大小），而不是0

从行到列转置部分匹配的数据

1 个答案: