使用R中的另一个数据帧重新编码大量变量

时间:2017-10-03 13:35:38

标签: r

我想使用数据帧(Df2)重新编码另一个数据帧(Df1)的变量,以便最终结果是包含本地/国际而不是1s / 2s等文本的数据帧(DF3)。 Df1数据框中存在缺失,我希望确保它表示为NA。

这是一个最小的工作示例,实际数据集包含超过一百个变量(所有变量都是字符类),具有1到15个级别。任何帮助将非常感激。

起点(dfs)

Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),"seller_Q2"=c(2,1,3,2),"price_Q1_2"=c(2,5,7,5))
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),"VariableLevel"=c(1,2,1,2,3,2,5,7),"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"))

期望的结果(df)

Df3 <- data.frame("buyer_Q1"=c("local","internat","local","local"),"seller_Q2"=c("internat","local","NA","internat"),"price_Q1_2"=c("50-100K","100-200K","200+K","100-200K"))

思考,而不是真正的代码,到目前为止:(如果df2 NameOfVariable和df1变量名的一行之间匹配,以及df2 VariableLevel和df1观察行之间的匹配,那么将相应的df2 VariableDef行粘贴到df1。想知道你是否可以使用if语句。)

if (Df2["NameOfVariable"]==names(Df1))
{
  if (Df2["VariableLevel"]==Df1[ ])
  {
   Df1[ ] <- paste0("VariableDef") 
  }
}

3 个答案:

答案 0 :(得分:1)

以下是使用matchMap的基础R中的方法。 Map将函数应用于相应的列表元素。这里有两个列表元素:Df1和一个由Df2的第二和第三列组成的列表,按列1分割。第二个列表被重新排序以匹配Df1中名称的顺序。

应用函数将列Df1中的元素与第二个参数中的相应列匹配,并将其用作索引以返回Df2参数的相应名称。 Map返回一个列表,该列表将转换为具有相同名称功能的data.frame。

data.frame(Map(function(x, y) y[[2]][match(x, y[[1]])],
               Df1,
               split(Df2[2:3], Df2[1])[names(Df1)]))

返回

  buyer_Q1 seller_Q2 price_Q1_2
1    local  internat    50-100K
2 internat     local   100-200K
3    local        NA      200+K
4    local  internat   100-200K

答案 1 :(得分:0)

使用循环和因子的解决方案。小心。结果似乎相当,但事实并非如此。函数fun返回带有因子的数据框。如果需要,您可以将它们转换为字符。

Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),"seller_Q2"=c(2,1,3,2),"price_Q1_2"=c(2,5,7,5))
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),"VariableLevel"=c(1,2,1,2,3,2,5,7),"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"))
Df3 <- data.frame("buyer_Q1"=c("local","internat","local","local"),"seller_Q2"=c("internat","local","NA","internat"),"price_Q1_2"=c("50-100K","100-200K","200+K","100-200K"))

fun <- function(df, mdf) {
  for (varn in names(df)) {
    dat <- mdf[mdf$NameOfVariable == varn & !is.na(mdf$VariableDef),]
    df[[varn]] <- factor(df[[varn]], dat$VariableLevel, dat$VariableDef)
  }
  return(df)
}

fun(Df1, Df2)
Df3

答案 2 :(得分:0)

来自dplyrtidyr的解决方案。即使有警告消息,代码也能正常工作,因为列是因子。如果您不想看到任何警告消息,请在创建数据框时设置stringsAsFactors = FALSE,如我提供的示例所示。

library(dplyr)
library(tidyr)

Df3 <- Df1 %>%
  mutate(ID = 1:n()) %>%
  gather(NameOfVariable, VariableLevel, -ID) %>%
  left_join(Df2, by = c("NameOfVariable", "VariableLevel")) %>%
  select(-VariableLevel) %>%
  spread(NameOfVariable, VariableDef) %>%
  select(-ID)

Df3
  buyer_Q1 price_Q1_2 seller_Q2
1    local    50-100K  internat
2 internat   100-200K     local
3    local      200+K        NA
4    local   100-200K  internat

数据

Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),
                  "seller_Q2"=c(2,1,3,2),
                  "price_Q1_2"=c(2,5,7,5),
                  stringsAsFactors = FALSE)
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),
                  "VariableLevel"=c(1,2,1,2,3,2,5,7),
                  "VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"),
                  stringsAsFactors = FALSE)