我有一个如下所示的主列:
CompanyName
Google
Tesco
我还有另一个看起来像这样的数据框:
CompanyVariationsNames
google plc
tesco bank
tesco insurance
google finance
google play
我需要数据看起来像这样:
Company Name Variation1 Variation2 Variation3
Google google plc google finance google play
Tesco tesco bank tesco insurance
这只是一个样本,因为我有750个公司名称,已经返回了大约5000个公司名称变体。
我已设法使用以下代码在一列中获取所有匹配的客户变体,因为CompanyVariationsNames
列来自超过10万公司名称的池:
matched1 = subset(DF, grepl(paste(Set1, collapse = "|"), DF$Customer_Name, ignore.case = T ))
但是我找不到让它们看起来像上面提到的结果的方法。
任何建议都会感激不尽!
答案 0 :(得分:1)
一个选项可能是
library(dplyr)
library(splitstackshape)
df2 %>%
rowwise() %>%
mutate(CompanyName = ifelse(is.null(as.character(Filter(length,
lapply(df1$CompanyName, function(x) x[grepl(x, CompanyVariationsNames, ignore.case=T)])))),
NA,
as.character(Filter(length,
lapply(df1$CompanyName, function(x) x[grepl(x, CompanyVariationsNames, ignore.case=T)]))))) %>%
filter(!is.na(CompanyName)) %>%
group_by(CompanyName) %>%
summarise(Variation = paste(CompanyVariationsNames, collapse=",")) %>%
cSplit("Variation", ",")
输出为:
CompanyName Variation_1 Variation_2 Variation_3
1: Google google plc google finance google play
2: Tesco tesco bank tesco insurance NA
示例数据:
df1 <- structure(list(CompanyName = c("Google", "Tesco")), .Names = "CompanyName", class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(CompanyVariationsNames = c("google plc", "tesco bank",
"tesco insurance", "google finance", "google play")), .Names = "CompanyVariationsNames", class = "data.frame", row.names = c(NA,
-5L))
更新:添加逻辑以处理以下错误
mutate_impl(.data,dots)中的错误:列&#34; CompanyName&#34;一定是 长度1(组大小),而不是0