要将列表类型列与DF

时间:2017-01-02 19:34:00

标签: r regex match

我有一个具有近似结构的数据框:

         C1                   C2      C3
1  c("XXX", "Y3")            "XXX"   "Y31"
2  c("SFM", "DD31", "DSDW")  "SFF"   "DD31"

列C1是列表。这是一个字符串,我分成了单独的单词。其他2列是字符。 我需要将C2和C3与C1匹配,以便在匹配的情况下(100%存在匹配),将C1中的值替换为另一个值。例如:

第一行有2个匹配,因为模糊匹配也是匹配:

  1. C1~C2:用C1“XXX [TAG]”中的修改值替换C1中的“XXX”
  2. C1~C3:将C1中的“Y3”替换为C3“Y31 [TAG]”的修改值
  3. 一般来说,我理解如何做到这一点:使用for循环,匹配函数和正则表达式,但我的知识不允许我将所有内容组合在一起。提前谢谢!

    EDITED

    我有什么:

    x <- structure(list(Description = list(c("2012", "Deere", "544K", 
                                        "Wheel", "Loader,"), c("Caterpillar","Model", "988", "Year", "1972")), 
                        Manufacturer = c("john deere", "caterpillar"), 
                        Model = c("544k", "988")), .Names = c("Description", "Manufacturer", "Model"), row.names = 4:5, class = "data.frame")
    
    
    #>     Description                        Manufacturer Model
    #> 4   2012, Deere, 544K, Wheel, Loader,   john deere  544k
    #> 5 Caterpillar, Model, 988, Year, 1972  caterpillar   988
    

    我想拥有的内容:

    x.new <- structure(list(Description = list(c("2012", "john deere[Manufacturer]", "544k[Model]", 
                                             "Wheel", "Loader,"), c("caterpillar[Manufacturer]","Model", "988[Model]", "Year", "1972")), 
                        Manufacturer = c("john deere", "caterpillar"), 
                        Model = c("544k", "988")), .Names = c("Description", "Manufacturer", "Model"), row.names = 4:5, class = "data.frame")
    
    #>  Description                                                 Manufacturer Model
    #> 4 2012, john deere[Manufacturer], 544k[Model], Wheel, Loader,  john deere  544k
    #> 5 caterpillar[Manufacturer], Model, 988[Model], Year, 1972    caterpillar   988
    

1 个答案:

答案 0 :(得分:4)

使用列表列,您需要大量lapply及其多变量等效项Map,它允许您遍历列表列并返回可以重新分配为列的列表。例如,

df <- structure(list(C1 = list(c("XXX", "Y3"), c("SFM", "DD31", "DSDW")), 
                     C2 = c("XXX", "SFF"), 
                     C3 = c("Y31", "DD31")), 
                .Names = c("C1", "C2", "C3"), row.names = c(NA, -2L), class = "data.frame")

df$C1_new <- Map(function(c1, c2, c3){
    sapply(c1, function(x){
        mtch <- grepl(x, c(c2, c3)); 
        if (any(mtch)) {paste0(c(c2, c3)[mtch], '[', names(df)[-1][mtch], ']')} else {x}
    })},
    df$C1, df$C2, df$C3)

df
#>                C1  C2   C3              C1_new
#> 1         XXX, Y3 XXX  Y31    XXX[C2], Y31[C3]
#> 2 SFM, DD31, DSDW SFF DD31 SFM, DD31[C3], DSDW

还有许多其他方法可以设置它,包括使用purrrstringr之类的包,使语法更简单,更统一。随你而过。

要应用于列出的第二个数据集,它可以进行一些轻微的编辑:

x <- structure(list(Description = list(c("2012", "Deere", "544K", "Wheel", "Loader,"), 
                                       c("Caterpillar","Model", "988", "Year", "1972")), 
                    Manufacturer = c("john deere", "caterpillar"), 
                    Model = c("544k", "988")), 
               .Names = c("Description", "Manufacturer", "Model"), row.names = 4:5, class = "data.frame")

x$Description <- Map(function(desc, mfr, mdl){
    sapply(desc, function(wrd){
        mtch <- grepl(wrd, c(mfr, mdl), ignore.case = TRUE); 
        if (any(mtch)) {paste0(c(mfr, mdl)[mtch], '[', names(x)[-1][mtch], ']')} else {wrd}
    })},
    x$Description, x$Manufacturer, x$Model)

x
#>                                                   Description Manufacturer Model
#> 4 2012, john deere[Manufacturer], 544k[Model], Wheel, Loader,   john deere  544k
#> 5    caterpillar[Manufacturer], Model, 988[Model], Year, 1972  caterpillar   988