用查找表dplyr替换字符串

时间:2015-07-08 14:45:37

标签: r string dplyr

我正在尝试在R中创建一个查找表,以便以与我工作的公司相同的格式获取我的数据。

它考虑了我想要使用dplyr合并的不同教育类别。

library(dplyr)

# Create data
education <- c("Mechanichal Engineering","Electric Engineering","Political Science","Economics")

    data <- data.frame(X1=replicate(1,sample(education,1000,rep=TRUE)))

    tbl_df(data)

    # Create lookup table
    lut <- c("Mechanichal Engineering" = "Engineering",
             "Electric Engineering" = "Engineering",
             "Political Science" = "Social Science",
             "Economics" = "Social Science")

    # Assign lookup table
    data$X1 <- lut[data$X1]

但是在我的输出中,我的旧值被替换为错误的值,即不是我在查找表中创建的值。相反,似乎查找表是随机分配的。

3 个答案:

答案 0 :(得分:2)

education <- c("Mechanichal Engineering","Electric Engineering","Political Science","Economics")
lut <- list("Mechanichal Engineering" = "Engineering",
            "Electric Engineering" = "Engineering",
            "Political Science" = "Social Science",
            "Economics" = "Social Science")
lut2<-melt(lut)
data1 <- data.frame(X1=replicate(1,sample(education,1000,rep=TRUE)))
data1$new <- lut2[match(data1$X1,lut2$L1),'value']
head(data1)


=======================  ==============
X1                       new           
=======================  ==============
Political Science        Social Science
Political Science        Social Science
Mechanichal Engineering  Engineering   
Mechanichal Engineering  Engineering   
Political Science        Social Science
Political Science        Social Science
=======================  ==============

答案 1 :(得分:2)

我一直试图自己解决这个问题。我对我发现的大多数解决方案都不太满意,所以这就是我最终的结果。我添加了一个“其他”类别,以表明即使查找表中没有定义值,它也能正常工作。

|Sales page| -callback-> |Login page| -tokenU-> |UserBase Cloud Function| -token?-> |Sales Cloud Function|
|          | <--tokenS-- |          | <-tokenS- |                       | <-tokenS- |                    |

答案 2 :(得分:0)

我发现最好的方法是使用recode()包中的car

# Observe that dplyr also has a recode function, so require car after dplyr
    require(dplyr)
    require(car)

数据是从中抽样的四种教育类别。

    education <- c("Mechanichal Engineering",
                   "Electric Engineering","Political Science","Economics")

data <- data.frame(ID = c(1:1000), X1 = replicate(1,sample(education,1000,rep=TRUE)))

对数据使用recode()我重新编码类别

lut <- data.frame(ID = c(1:1000), X2 = recode(data$X1, '"Economics" = "Social Science";
                         "Electric Engineering" = "Engineering";
                          "Political Science" = "Social Science";
                          "Mechanichal Engineering" = "Engineering"'))

要查看其是否正确执行,请加入原始数据和重新编码的数据

data <- full_join(data, lut, by = "ID")

head(data)

   ID                     X1             X2
1  1       Political Science Social Science
2  2               Economics Social Science
3  3    Electric Engineering    Engineering
4  4       Political Science Social Science
5  5               Economics Social Science
6  6 Mechanichal Engineering    Engineering

使用recode,您无需在重新编码之前对数据进行排序。