通过隐式列合并R中的两个表

时间:2019-03-03 10:30:38

标签: r dplyr lapply

我有两个桌子

tab1=structure(list(generated_id = c(482160724447511, 482160724447511
), utc_time = structure(c(1L, 1L), .Label = "30.09.2018 12:46", class = "factor"), 
    local_time = structure(c(1L, 1L), .Label = "30.09.2018 15:46", class = "factor"), 
    user_locale = structure(c(1L, 1L), .Label = "en", class = "factor"), 
    network = structure(c(1L, 1L), .Label = "Facebook Installs", class = "factor"), 
    campaign = structure(c(1L, 1L), .Label = "(GR23)(BGM)(AND)(FB)(App Events)(US)(W35+)(27.09.2018) (23843105742120752)", class = "factor"), 
    adgroup = structure(c(1L, 1L), .Label = "(GR23)(BGM)(AND)(FB)(META)(US)(W35+)(NONE)(APP_EV)(NONE)(PURCHASE)(NONE)(27.09.2018) (23843105743590752)", class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

tab2=
structure(list(date = structure(c(1L, 1L), .Label = "10.10.2018", class = "factor"), 
    campaign_id = c(2.38431e+16, 2.38431e+16), ad_set_id = c(2.38431e+16, 
    2.38431e+16), spent = c(1.77, 13.85)), class = "data.frame", row.names = c(NA, 
-2L))

tab2$campaign_id=tab1$campaign
tab2$ad_set_id=tab1$adgroup

通常我使用sinple函数合并

merge(tab1,tab2 , by =c("campaign", "adgroup"
))

但是在这种情况下,我遇到了困难,因为tab1$campaign的ID在方括号的结尾

(GR23)(BGM)(AND)(FB).... (***23843105743590752***)


(GR23)(BGM)(AND)(FB)(META)(US)(W35+)(NONE)(APP_EV)(NONE)(PURCHASE)(NONE)(27.09.2018) (***23843105743590752***)

其中( * * )是要合并的ID

在这种情况下,如果方括号中的tab1键ID位于tab1和id之间,我该如何按广告系列和广告组合并tab1和tab2?

1 个答案:

答案 0 :(得分:1)

如果我正确理解了您的问题,那么现在的问题是将表合并到列的子字符串上。 实现此目的的一种方法是提取该子字符串并将其添加到tab1

由于tab1中的行是相同的,并且tab2中的id与tab1中的任何一个都不匹配,因此我使用了不同的集合:

tab1 <- structure(list(campaign = c("(GR23)(BGM)(AND)(FB)(App Events)(US)(W35+)(27.09.2018) (23843105742120752)", 
                                    "(GR23)(BGM)(AND)(FB)(App Events)(US)(W35+)(27.09.2018) (23843105742120753)"), 
                       adgroup = c("(GR23)(BGM)(AND)(FB)(META)(US)(W35+)(NONE)(APP_EV)(NONE)(PURCHASE)(NONE)(27.09.2018) (23843105743590752)", 
                                   "(GR23)(BGM)(AND)(FB)(META)(US)(W35+)(NONE)(APP_EV)(NONE)(PURCHASE)(NONE)(27.09.2018) (23843105743590752)"), 
                       generated_id = c(482160724447511, 482160724447511)), 
                  row.names = c(NA, -2L), class = "data.frame")
tab2 <- structure(list(campaign_id = c("23843105742120752", "23843105742120753"), 
                       ad_set_id = c("23843105743590752", "23843105743590752"), 
                       date = c("10.10.2018", "10.10.2018"), spent = c(1.77, 13.85)), 
                  row.names = c(NA, -2L), class = "data.frame")


# Create a function that extracts the id from the last part
extract_id <- function(x){
  s <- strsplit(as.character(x), " ")
  s_id <- sapply(s, function(si) si[length(si)])
  ids <- gsub("[^[:digit:] ]", "", s_id) # Remove all but digits/numbers
  return(ids)
}

# Add the extracted id's to tab1
tab1$campaign_id <- extract_id(tab1$campaign)
tab1$adgroup_id <- extract_id(tab1$adgroup)

# Your result
result <- merge(tab1, tab2, 
                by.x = c("campaign_id", "adgroup_id"), 
                by.y = c("campaign_id", "ad_set_id"))

请注意,除了不同的值外,某些列还具有不同的类型。即character,而不是factor