从行到列转置部分匹配的数据

时间:2018-04-06 06:20:47

标签: r

我有一个如下所示的主列:

CompanyName Google Tesco

我还有另一个看起来像这样的数据框:

CompanyVariationsNames google plc tesco bank tesco insurance google finance google play

我需要数据看起来像这样:

Company Name Variation1 Variation2 Variation3 Google google plc google finance google play Tesco tesco bank tesco insurance

这只是一个样本,因为我有750个公司名称,已经返回了大约5000个公司名称变体。 我已设法使用以下代码在一列中获取所有匹配的客户变体,因为CompanyVariationsNames列来自超过10万公司名称的池:

matched1 = subset(DF, grepl(paste(Set1, collapse = "|"), DF$Customer_Name, ignore.case = T ))但是我找不到让它们看起来像上面提到的结果的方法。 任何建议都会感激不尽!

1 个答案:

答案 0 :(得分:1)

一个选项可能是

library(dplyr)
library(splitstackshape)

df2 %>%
  rowwise() %>%
  mutate(CompanyName = ifelse(is.null(as.character(Filter(length, 
                                           lapply(df1$CompanyName, function(x) x[grepl(x, CompanyVariationsNames, ignore.case=T)])))),
                              NA,
                              as.character(Filter(length, 
                                                  lapply(df1$CompanyName, function(x) x[grepl(x, CompanyVariationsNames, ignore.case=T)]))))) %>%
  filter(!is.na(CompanyName)) %>%
  group_by(CompanyName) %>%
  summarise(Variation = paste(CompanyVariationsNames, collapse=",")) %>%
  cSplit("Variation", ",")

输出为:

   CompanyName Variation_1     Variation_2 Variation_3
1:      Google  google plc  google finance google play
2:       Tesco  tesco bank tesco insurance          NA

示例数据:

df1 <- structure(list(CompanyName = c("Google", "Tesco")), .Names = "CompanyName", class = "data.frame", row.names = c(NA, 
-2L))

df2 <- structure(list(CompanyVariationsNames = c("google plc", "tesco bank", 
"tesco insurance", "google finance", "google play")), .Names = "CompanyVariationsNames", class = "data.frame", row.names = c(NA, 
-5L))

更新:添加逻辑以处理以下错误

  

mutate_impl(.data,dots)中的错误:列&#34; CompanyName&#34;一定是   长度1(组大小),而不是0