R +按比例匹配值(使用apply?)

时间:2014-03-26 16:34:27

标签: r match mapply

有没有办法让大规模的匹配值更具编程性?基本上我想要做的是在数据帧上添加一堆用于值查找的列,但我不想每次都写入match []参数。这似乎是 mapply 的一个用例,但我不知道如何在这里使用它。有什么建议?

以下是数据:

data <- data.frame(
    region = sample(c("northeast","midwest","west"), 50, replace = T),
    climate = sample(c("dry","cold","arid"), 50, replace = T),
    industry = sample(c("tech","energy","manuf"), 50, replace = T))

以及相应的查找表:

lookups <- data.frame(
    orig_val = c("northeast","midwest","west","dry","cold","arid","tech","energy","manuf"),
    look_val = c("dir1","dir2","dir3","temp1","temp2","temp3","job1","job2","job3")
    )    

所以现在我想要做的是:首先在“数据”中添加一个名为“reg_lookups”的列,它将在“lookups”中将该区域与其适当的值匹配。对“climate_lookups”等做同样的事情。

现在,我已经弄得一团糟了:

data$reg_lookup <- lookups$look_val[match(data$region, lookups$orig_val)]
data$clim_lookup <- lookups$look_val[match(data$climate, lookups$orig_val)]
data$indus_lookup <- lookups$look_val[match(data$industry, lookups$orig_val)]

我已经尝试使用一个函数来执行此操作,但该函数似乎不起作用,因此将其应用于 mapply 是一个禁忌(加上我对如何mapply语法可以在这里工作):

match_fun <- function(df, newval, df_look, lookup_val, var, ref_val) {
    df$newval <- df_look$lookup_val[match(df$var, df_look$ref_val)]
    return(df)
}

data2 <- match_fun(data, reg_2, lookups, look_val, region, orig_val)

1 个答案:

答案 0 :(得分:0)

我认为你只是想这样做:

data <- merge(data,lookups[1:3,],by.x = "region",by.y = "orig_val",all.x = TRUE)
data <- merge(data,lookups[4:6,],by.x = "climate",by.y = "orig_val",all.x = TRUE)
data <- merge(data,lookups[7:9,],by.x = "industry",by.y = "orig_val",all.x = TRUE)

但是将查找存储在单独的数据帧中要好得多。这样,您可以更轻松地控制新列的名称。它还允许你做这样的事情:

lookups1 <- split(lookups,rep(1:3,each = 3))
colnames(lookups1[[1]]) <- c('region','reg_lookup')
colnames(lookups1[[2]]) <- c('climate','clim_lookup')
colnames(lookups1[[3]]) <- c('industry','indus_lookup')

do.call(cbind,mapply(merge,
        x = list(data[,1,drop = FALSE],data[,2,drop =FALSE],data[,3,drop = FALSE]),
        y = lookups1,
        moreArgs = list(all.x = TRUE),
        SIMPLIFY = FALSE))

并且您应该能够在函数中包含do.call位。

我使用data[,1,drop = FALSE]将它们保存为一个列数据框。

构建mapply调用的方式是将命名参数作为列表传递(x =y =部分)。我希望确保保留data中的所有行,因此我通过all.x = TRUE传递moreArgs,以便每次调用merge时都会传递。最后,我需要自己将它们拼接在一起,所以我关掉了SIMPLIFY