如何将向量中的多个值重新编码为一个值?

时间:2019-07-12 04:54:24

标签: r replace recode

我遇到一个问题,由于数据源不同,我的data.frame由不同的属性组成。例如,state列实际上处于相同状态,但表示形式不同。请注意,我的实际数据未使用美国各州。

    df <- data.frame(Names=c("Adam", "Mark", "Dahlia", "Jeff", "Derek", 
                             "Arnold", "Sheppard", "Dwayne", "Nichols", "Shane"), 
                     Age=c(27, 28, 29, 37, 26, 22, 29, 34, 31, 30), 
                     States=c("AL", "Alaska", "Alabama", "WI", 
                              "Wisconsin", "AZ", "Arizona", "AL", "WI", "AK"))

我正在尝试将AL,WI,AZ和AK等值分别重新编码为阿拉巴马州,威斯康星州,亚利桑那州和阿拉斯加。

到目前为止,我遇到了:

    case_when(

        df$States == "AL" ~ "Alabama",
        df$States == "AK" ~ "Alaska",
        df$States == "WI" ~ "Wisconsin",
        df$States == "AZ" ~ "Arizona",
    )

它给了我输出:

     [1] "Alabama"   NA          NA          "Wisconsin" NA    "Arizona" NA         
     [8] "Alabama"   "Wisconsin" "Alaska"

我不需要NA值,所以我要做的是:

    case_when(

      df$States == "AL" ~ "Alabama",
      df$States == "Alabama" ~ "Alabama",
      df$States == "AK" ~ "Alaska",
      df$States == "Alaska" ~ "Alaska",
      df$States == "WI" ~ "Wisconsin",
      df$States == "Wisconsin" ~ "Wisconsin",
      df$States == "AZ" ~ "Arizona",
      df$States == "Arizona" ~ "Arizona",

    )

它给了我想要的输出,但是我认为有更简单的方法可以做到这一点。

我正在考虑循环,因为稍后我想将其转换为伪代码。但是,我没有关于如何执行此操作的想法。真的很感谢大家的帮助。

谢谢。

2 个答案:

答案 0 :(得分:1)

您可以将dplyr的recode函数与命名向量一起使用。我使用setNames来创建一个命名的字符向量(类似于键/值对),但是您可以使用任何数据来创建向量。使用您的示例,我们可以设置一些键和值:

keys <- state.abb # the abbreviations you want to replace
vals <- state.name # the replacement values
keysvals <- setNames(vals, keys) # create named vector

现在致电recode。确保使用!!!取消引用和拼接:

library(dplyr)

df$States <- recode(df$States, !!!keysvals)

哪个会返回:

      Names Age    States
1      Adam  27   Alabama
2      Mark  28    Alaska
3    Dahlia  29   Alabama
4      Jeff  37 Wisconsin
5     Derek  26 Wisconsin
6    Arnold  22   Arizona
7  Sheppard  29   Arizona
8    Dwayne  34   Alabama
9   Nichols  31 Wisconsin
10    Shane  30    Alaska

答案 1 :(得分:0)

如果您打算与美国各州名称匹配,我们可以使用内置向量state.abbstate.name进行匹配和替换。

inds <- match(df$States, state.abb)
df$States[which(!is.na(inds))] <- state.name[na.omit(inds)]

df
#       Names Age   States
#1      Adam  27   Alabama
#2      Mark  28    Alaska
#3    Dahlia  29   Alabama
#4      Jeff  37 Wisconsin
#5     Derek  26 Wisconsin
#6    Arnold  22   Arizona
#7  Sheppard  29   Arizona
#8    Dwayne  34   Alabama
#9   Nichols  31 Wisconsin
#10    Shane  30    Alaska

还可以通过使用case_when来缩短%in%的长度,该==可以比较多个向量,而不是使用library(dplyr) df %>% mutate(States = case_when(States %in% c("AL", "Alabama") ~ "Alabama", States %in% c("AK", "Alaska")~ "Alaska", States %in% c("WI", "Wisconsin")~ "Wisconsin", States %in% c("AZ", "Arizona")~ "Arizona", TRUE ~ NA_character_)) 来比较一个向量

KEY |ColumnA | ColumnB
1   |Value A | ValueB
2   |ValueA2 | ValueB2