我遇到一个问题,由于数据源不同,我的data.frame
由不同的属性组成。例如,state
列实际上处于相同状态,但表示形式不同。请注意,我的实际数据未使用美国各州。
df <- data.frame(Names=c("Adam", "Mark", "Dahlia", "Jeff", "Derek",
"Arnold", "Sheppard", "Dwayne", "Nichols", "Shane"),
Age=c(27, 28, 29, 37, 26, 22, 29, 34, 31, 30),
States=c("AL", "Alaska", "Alabama", "WI",
"Wisconsin", "AZ", "Arizona", "AL", "WI", "AK"))
我正在尝试将AL,WI,AZ和AK等值分别重新编码为阿拉巴马州,威斯康星州,亚利桑那州和阿拉斯加。
到目前为止,我遇到了:
case_when(
df$States == "AL" ~ "Alabama",
df$States == "AK" ~ "Alaska",
df$States == "WI" ~ "Wisconsin",
df$States == "AZ" ~ "Arizona",
)
它给了我输出:
[1] "Alabama" NA NA "Wisconsin" NA "Arizona" NA
[8] "Alabama" "Wisconsin" "Alaska"
我不需要NA
值,所以我要做的是:
case_when(
df$States == "AL" ~ "Alabama",
df$States == "Alabama" ~ "Alabama",
df$States == "AK" ~ "Alaska",
df$States == "Alaska" ~ "Alaska",
df$States == "WI" ~ "Wisconsin",
df$States == "Wisconsin" ~ "Wisconsin",
df$States == "AZ" ~ "Arizona",
df$States == "Arizona" ~ "Arizona",
)
它给了我想要的输出,但是我认为有更简单的方法可以做到这一点。
我正在考虑循环,因为稍后我想将其转换为伪代码。但是,我没有关于如何执行此操作的想法。真的很感谢大家的帮助。
谢谢。
答案 0 :(得分:1)
您可以将dplyr的recode
函数与命名向量一起使用。我使用setNames
来创建一个命名的字符向量(类似于键/值对),但是您可以使用任何数据来创建向量。使用您的示例,我们可以设置一些键和值:
keys <- state.abb # the abbreviations you want to replace
vals <- state.name # the replacement values
keysvals <- setNames(vals, keys) # create named vector
现在致电recode
。确保使用!!!
取消引用和拼接:
library(dplyr)
df$States <- recode(df$States, !!!keysvals)
哪个会返回:
Names Age States
1 Adam 27 Alabama
2 Mark 28 Alaska
3 Dahlia 29 Alabama
4 Jeff 37 Wisconsin
5 Derek 26 Wisconsin
6 Arnold 22 Arizona
7 Sheppard 29 Arizona
8 Dwayne 34 Alabama
9 Nichols 31 Wisconsin
10 Shane 30 Alaska
答案 1 :(得分:0)
如果您打算与美国各州名称匹配,我们可以使用内置向量state.abb
和state.name
进行匹配和替换。
inds <- match(df$States, state.abb)
df$States[which(!is.na(inds))] <- state.name[na.omit(inds)]
df
# Names Age States
#1 Adam 27 Alabama
#2 Mark 28 Alaska
#3 Dahlia 29 Alabama
#4 Jeff 37 Wisconsin
#5 Derek 26 Wisconsin
#6 Arnold 22 Arizona
#7 Sheppard 29 Arizona
#8 Dwayne 34 Alabama
#9 Nichols 31 Wisconsin
#10 Shane 30 Alaska
还可以通过使用case_when
来缩短%in%
的长度,该==
可以比较多个向量,而不是使用library(dplyr)
df %>%
mutate(States = case_when(States %in% c("AL", "Alabama") ~ "Alabama",
States %in% c("AK", "Alaska")~ "Alaska",
States %in% c("WI", "Wisconsin")~ "Wisconsin",
States %in% c("AZ", "Arizona")~ "Arizona",
TRUE ~ NA_character_))
来比较一个向量
KEY |ColumnA | ColumnB
1 |Value A | ValueB
2 |ValueA2 | ValueB2