我有一个具有近似结构的数据框:
C1 C2 C3
1 c("XXX", "Y3") "XXX" "Y31"
2 c("SFM", "DD31", "DSDW") "SFF" "DD31"
列C1是列表。这是一个字符串,我分成了单独的单词。其他2列是字符。 我需要将C2和C3与C1匹配,以便在匹配的情况下(100%存在匹配),将C1中的值替换为另一个值。例如:
第一行有2个匹配,因为模糊匹配也是匹配:
一般来说,我理解如何做到这一点:使用for循环,匹配函数和正则表达式,但我的知识不允许我将所有内容组合在一起。提前谢谢!
我有什么:
x <- structure(list(Description = list(c("2012", "Deere", "544K",
"Wheel", "Loader,"), c("Caterpillar","Model", "988", "Year", "1972")),
Manufacturer = c("john deere", "caterpillar"),
Model = c("544k", "988")), .Names = c("Description", "Manufacturer", "Model"), row.names = 4:5, class = "data.frame")
#> Description Manufacturer Model
#> 4 2012, Deere, 544K, Wheel, Loader, john deere 544k
#> 5 Caterpillar, Model, 988, Year, 1972 caterpillar 988
我想拥有的内容:
x.new <- structure(list(Description = list(c("2012", "john deere[Manufacturer]", "544k[Model]",
"Wheel", "Loader,"), c("caterpillar[Manufacturer]","Model", "988[Model]", "Year", "1972")),
Manufacturer = c("john deere", "caterpillar"),
Model = c("544k", "988")), .Names = c("Description", "Manufacturer", "Model"), row.names = 4:5, class = "data.frame")
#> Description Manufacturer Model
#> 4 2012, john deere[Manufacturer], 544k[Model], Wheel, Loader, john deere 544k
#> 5 caterpillar[Manufacturer], Model, 988[Model], Year, 1972 caterpillar 988
答案 0 :(得分:4)
使用列表列,您需要大量lapply
及其多变量等效项Map
,它允许您遍历列表列并返回可以重新分配为列的列表。例如,
df <- structure(list(C1 = list(c("XXX", "Y3"), c("SFM", "DD31", "DSDW")),
C2 = c("XXX", "SFF"),
C3 = c("Y31", "DD31")),
.Names = c("C1", "C2", "C3"), row.names = c(NA, -2L), class = "data.frame")
df$C1_new <- Map(function(c1, c2, c3){
sapply(c1, function(x){
mtch <- grepl(x, c(c2, c3));
if (any(mtch)) {paste0(c(c2, c3)[mtch], '[', names(df)[-1][mtch], ']')} else {x}
})},
df$C1, df$C2, df$C3)
df
#> C1 C2 C3 C1_new
#> 1 XXX, Y3 XXX Y31 XXX[C2], Y31[C3]
#> 2 SFM, DD31, DSDW SFF DD31 SFM, DD31[C3], DSDW
还有许多其他方法可以设置它,包括使用purrr
和stringr
之类的包,使语法更简单,更统一。随你而过。
要应用于列出的第二个数据集,它可以进行一些轻微的编辑:
x <- structure(list(Description = list(c("2012", "Deere", "544K", "Wheel", "Loader,"),
c("Caterpillar","Model", "988", "Year", "1972")),
Manufacturer = c("john deere", "caterpillar"),
Model = c("544k", "988")),
.Names = c("Description", "Manufacturer", "Model"), row.names = 4:5, class = "data.frame")
x$Description <- Map(function(desc, mfr, mdl){
sapply(desc, function(wrd){
mtch <- grepl(wrd, c(mfr, mdl), ignore.case = TRUE);
if (any(mtch)) {paste0(c(mfr, mdl)[mtch], '[', names(x)[-1][mtch], ']')} else {wrd}
})},
x$Description, x$Manufacturer, x$Model)
x
#> Description Manufacturer Model
#> 4 2012, john deere[Manufacturer], 544k[Model], Wheel, Loader, john deere 544k
#> 5 caterpillar[Manufacturer], Model, 988[Model], Year, 1972 caterpillar 988