我想使用数据帧(Df2)重新编码另一个数据帧(Df1)的变量,以便最终结果是包含本地/国际而不是1s / 2s等文本的数据帧(DF3)。 Df1数据框中存在缺失,我希望确保它表示为NA。
这是一个最小的工作示例,实际数据集包含超过一百个变量(所有变量都是字符类),具有1到15个级别。任何帮助将非常感激。
起点(dfs)
Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),"seller_Q2"=c(2,1,3,2),"price_Q1_2"=c(2,5,7,5))
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),"VariableLevel"=c(1,2,1,2,3,2,5,7),"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"))
期望的结果(df)
Df3 <- data.frame("buyer_Q1"=c("local","internat","local","local"),"seller_Q2"=c("internat","local","NA","internat"),"price_Q1_2"=c("50-100K","100-200K","200+K","100-200K"))
思考,而不是真正的代码,到目前为止:(如果df2 NameOfVariable和df1变量名的一行之间匹配,以及df2 VariableLevel和df1观察行之间的匹配,那么将相应的df2 VariableDef行粘贴到df1。想知道你是否可以使用if语句。)
if (Df2["NameOfVariable"]==names(Df1))
{
if (Df2["VariableLevel"]==Df1[ ])
{
Df1[ ] <- paste0("VariableDef")
}
}
答案 0 :(得分:1)
以下是使用match
和Map
的基础R中的方法。 Map
将函数应用于相应的列表元素。这里有两个列表元素:Df1和一个由Df2的第二和第三列组成的列表,按列1分割。第二个列表被重新排序以匹配Df1中名称的顺序。
应用函数将列Df1中的元素与第二个参数中的相应列匹配,并将其用作索引以返回Df2参数的相应名称。 Map
返回一个列表,该列表将转换为具有相同名称功能的data.frame。
data.frame(Map(function(x, y) y[[2]][match(x, y[[1]])],
Df1,
split(Df2[2:3], Df2[1])[names(Df1)]))
返回
buyer_Q1 seller_Q2 price_Q1_2
1 local internat 50-100K
2 internat local 100-200K
3 local NA 200+K
4 local internat 100-200K
答案 1 :(得分:0)
使用循环和因子的解决方案。小心。结果似乎相当,但事实并非如此。函数fun
返回带有因子的数据框。如果需要,您可以将它们转换为字符。
Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),"seller_Q2"=c(2,1,3,2),"price_Q1_2"=c(2,5,7,5))
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),"VariableLevel"=c(1,2,1,2,3,2,5,7),"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"))
Df3 <- data.frame("buyer_Q1"=c("local","internat","local","local"),"seller_Q2"=c("internat","local","NA","internat"),"price_Q1_2"=c("50-100K","100-200K","200+K","100-200K"))
fun <- function(df, mdf) {
for (varn in names(df)) {
dat <- mdf[mdf$NameOfVariable == varn & !is.na(mdf$VariableDef),]
df[[varn]] <- factor(df[[varn]], dat$VariableLevel, dat$VariableDef)
}
return(df)
}
fun(Df1, Df2)
Df3
答案 2 :(得分:0)
来自dplyr
和tidyr
的解决方案。即使有警告消息,代码也能正常工作,因为列是因子。如果您不想看到任何警告消息,请在创建数据框时设置stringsAsFactors = FALSE
,如我提供的示例所示。
library(dplyr)
library(tidyr)
Df3 <- Df1 %>%
mutate(ID = 1:n()) %>%
gather(NameOfVariable, VariableLevel, -ID) %>%
left_join(Df2, by = c("NameOfVariable", "VariableLevel")) %>%
select(-VariableLevel) %>%
spread(NameOfVariable, VariableDef) %>%
select(-ID)
Df3
buyer_Q1 price_Q1_2 seller_Q2
1 local 50-100K internat
2 internat 100-200K local
3 local 200+K NA
4 local 100-200K internat
数据强>
Df1 <- data.frame("buyer_Q1"=c(1,2,1,1),
"seller_Q2"=c(2,1,3,2),
"price_Q1_2"=c(2,5,7,5),
stringsAsFactors = FALSE)
Df2 <- data.frame("NameOfVariable"=c("buyer_Q1","buyer_Q1","seller_Q2","seller_Q2","seller_Q2","price_Q1_2","price_Q1_2","price_Q1_2"),
"VariableLevel"=c(1,2,1,2,3,2,5,7),
"VariableDef"=c("local","internat","local","internat","NA","50-100K","100-200K","200+K"),
stringsAsFactors = FALSE)