我有以下与匹配不同数据框相关的问题。
首先,我有下一张桌子:
table<-data.frame(brand=c('duna','cars','cars','sea','sea','sea','mega','moon','moon'),model=c('mm','mm','mm','ll','ll','ll','tr','tr','tr'),version=c("2.8 sr cab. dupla 4x4 tdi","2.0 lsdakar 16v 4x4 hi-flex 5-p","2.4 ls cab. simples 4x2 flex 2-p","2.3 xl cab. simples 4x2 2-p","1.8 sx 5-p","1.0 mpfi joy 8v","hatch ls 1.0 8v","2.3 xlt cab. dupla 4x2 limited 4-p","1.4 fire ce xlt flex 2-p"))
brand model version
1 duna mm 2.8 sr cab. dupla 4x4 tdi
2 cars mm 2.0 lsdakar 16v 4x4 hi-flex 5-p
3 cars mm 2.4 ls cab. simples 4x2 flex 2-p
4 sea ll 2.3 xl cab. simples 4x2 2-p
5 sea ll 1.8 sx 5-p
6 sea ll 1.0 mpfi joy 8v
7 mega tr hatch ls 1.0 8v
8 moon tr 2.3 xlt cab. dupla 4x2 limited 4-p
9 moon tr 1.4 fire ce xlt flex 2-p
我必须将它与下一个匹配:
table_match<-data.frame(brand=c('duna','cars','sea','mega','moon'),model=c('mm','mm','ll','tr','tr'),version=c('tdi','ls','xl','ls','xlt'))
table_match$id<-paste0(table_match$brand,table_match$model,table_match$version)
brand model version id
1 duna mm tdi dunammtdi
2 cars mm ls carsmmls
3 sea ll xl seallxl
4 mega tr ls megatrls
5 moon tr xlt moontrxlt
所以,这里的问题是将brand
,model
和version
与table_match
匹配到table
。
例如,在table
brand=duna
中,model=mm
和version
包含确切的单词"tdi"
,那么它就是匹配!因此id
(在table_match
中)与该匹配位于version
旁边。
brand model version match
1 duna mm 2.8 sr cab. dupla 4x4 tdi dunammtdi
2 cars mm 2.0 lsdakar 16v 4x4 hi-flex 5-p
3 cars mm 2.4 ls cab. simples 4x2 flex 2-p carsmmls
4 sea ll 2.3 xl cab. simples 4x2 2-p seallxl
5 sea ll 1.8 sx 5-p
6 sea ll 1.0 mpfi joy 8v
7 mega tr hatch ls 1.0 8v megatrls
8 moon tr 2.3 xlt cab. dupla 4x2 limited 4-p moontrxlt
9 moon tr 1.4 fire ce xlt flex 2-p moontrxlt
答案 0 :(得分:3)
我认为我们可以只进行合并然后使用正则表达式对其进行过滤。试一试
dat = merge(table, table_match, by = c("brand", "model"))
dat$match = mapply(function(x, y) grepl(paste("\\b", x, "\\b", sep = ""), y), dat$version.y, dat$version.x)
dat$match = ifelse(dat$match, dat$id, "")
dat = dat[ , !colnames(dat) %in% c("version.y", "id")]
colnames(dat)[3] = "version"
dat = dat[with(dat, order(brand)), ]
brand model version match
1 cars mm 2.0 lsdakar 16v 4x4 hi-flex 5-p
2 cars mm 2.4 ls cab. simples 4x2 flex 2-p carsmmls
3 duna mm 2.8 sr cab. dupla 4x4 tdi dunammtdi
4 mega tr hatch ls 1.0 8v megatrls
5 moon tr 2.3 xlt cab. dupla 4x2 limited 4-p moontrxlt
6 moon tr 1.4 fire ce xlt flex 2-p moontrxlt
7 sea ll 2.3 xl cab. simples 4x2 2-p seallxl
8 sea ll 1.8 sx 5-p
9 sea ll 1.0 mpfi joy 8v
答案 1 :(得分:2)
您还可以尝试使用regex_join()
包中的fuzzyjoin
。注:我在table_match
中的字符串'ls'周围添加了空格,这样正则表达式与table
中的字符串'lsdakar'不匹配,因为这不是原始海报想要的。
library(fuzzyjoin)
# Use data_frame() to get rid of stringsAsFactors problem
table <-
data_frame(
brand = c('duna', 'cars', 'cars', 'sea', 'sea', 'sea', 'mega', 'moon', 'moon'),
model = c('mm', 'mm', 'mm', 'll', 'll', 'll', 'tr', 'tr', 'tr'),
version = c(
"2.8 sr cab. dupla 4x4 tdi",
"2.0 lsdakar 16v 4x4 hi-flex 5-p",
"2.4 ls cab. simples 4x2 flex 2-p",
"2.3 xl cab. simples 4x2 2-p",
"1.8 sx 5-p",
"1.0 mpfi joy 8v",
"hatch ls 1.0 8v",
"2.3 xlt cab. dupla 4x2 limited 4-p",
"1.4 fire ce xlt flex 2-p"
)
)
# Use data_frame() here too
table_match <-
data_frame(
brand = c('duna', 'cars', 'sea', 'mega', 'moon'),
model = c('mm', 'mm', 'll', 'tr', 'tr'),
version = c('tdi', ' ls ', 'xl', ' ls ', 'xlt')
)
# Use regex_semi_join to find the matches
regex_semi_join(table, table_match, by = c(brand = "brand",
model = "model", version = "version"))
# A tibble: 6 × 3
brand model version
<chr> <chr> <chr>
1 duna mm 2.8 sr cab. dupla 4x4 tdi
2 cars mm 2.4 ls cab. simples 4x2 flex 2-p
3 sea ll 2.3 xl cab. simples 4x2 2-p
4 mega tr hatch ls 1.0 8v
5 moon tr 2.3 xlt cab. dupla 4x2 limited 4-p
6 moon tr 1.4 fire ce xlt flex 2-p
>
# Use regex_anti_join to find the non-matches
regex_anti_join(table, table_match, by = c(brand = "brand",
model = "model", version = "version"))
# A tibble: 3 × 3
brand model version
<chr> <chr> <chr>
1 cars mm 2.0 lsdakar 16v 4x4 hi-flex 5-p
2 sea ll 1.8 sx 5-p
3 sea ll 1.0 mpfi joy 8v
>
答案 2 :(得分:1)
Fuzzy_join在两个相等的条件==
和第三个str_detect
来自stringr
。我不知道为什么模糊连接会产生三个重复,所以添加了unique()
注意:已添加stringsAsFactors = FALSE
以测试数据
table<-data.frame(brand=c('duna','cars','cars','sea','sea','sea','mega','moon','moon'),
model=c('mm','mm','mm','ll','ll','ll','tr','tr','tr'),
version=c("2.8 sr cab. dupla 4x4 tdi","2.0 lsdakar 16v 4x4 hi-flex 5-p","2.4 ls cab. simples 4x2 flex 2-p","2.3 xl cab. simples 4x2 2-p","1.8 sx 5-p","1.0 mpfi joy 8v","hatch ls 1.0 8v","2.3 xlt cab. dupla 4x2 limited 4-p","1.4 fire ce xlt flex 2-p")
stringsAsFactors = FALSE)
table_match<-data.frame(brand=c('duna','cars','sea','mega','moon'),
model=c('mm','mm','ll','tr','tr'),
version=c('tdi','ls','xl','ls','xlt'),
stringsAsFactors = FALSE)
library(fuzzyjoin)
library(stringr)
fuzzy_join(table, table_match,
by = c("brand", "model", "version"),
match_fun = c(`==`, `==`, function(x,y) { str_detect(x, paste0("\\b", y, "\\b" ))})) %>%
unique()
# brand.x model.x version.x brand.y model.y version.y
# 1 duna mm 2.8 sr cab. dupla 4x4 tdi duna mm tdi
# 4 cars mm 2.4 ls cab. simples 4x2 flex 2-p cars mm ls
# 7 sea ll 2.3 xl cab. simples 4x2 2-p sea ll xl
# 10 mega tr hatch ls 1.0 8v mega tr ls
# 13 moon tr 2.3 xlt cab. dupla 4x2 limited 4-p moon tr xlt
# 16 moon tr 1.4 fire ce xlt flex 2-p moon tr xlt