我有一个学校项目,而我花了三个多小时试图解决这个问题。我的数据集的第一个变量(“ df”)是“ AREA”。我已经成功过滤掉它,因此唯一的值就是美国各州的名称。
我正在寻找一个新的列/变量,叫做“区域”。它采用列在“ AREA”中的州,并返回四个美国人口普查地区名称之一。显然,R中已经有一个现有函数(state.region?),但是我无法使其正常工作,我宁愿对其进行长期编码。
这是清理数据并安装“ dplyr”,“ tidyr”和“ stringr”库之后的内容:
#Create U.S. Census regions
regionconvert<-function(x)
{
if(x %in% c("Texas","Oklahoma","Arkansas","Louisiana","Mississippi","Alabama","Georgia","Florida","Tennessee","Kentucky","West Virginia","Virginia","North Carolina","South Carolina", "Maryland","Delaware"))
{return("South")}
if(x %in% c("Maine","New Hampshire","Vermont","Massachusetts","Connecticut","Rhode Island","New York","New Jersey","Pennsylvannia"))
{return("Northeast")}
if(x %in% c("Ohio","Michigan","Illinois","Indiana","Wisconsin","Minnesota","Iowa","Missouri","North Dakota","South Dakota","Nebraska","Kansas"))
{return("Midwest")}
if(x %in% c("Alaska","Hawaii","Washington","Oregon","California","Nevada","Idaho","Utah","Arizona","New Mexico","Colorado","Wyoming","Montana"))
{return("West")}
}
dfRegion=mutate(df,"Region"=regionconvert(df$AREA))
我遇到以下错误,我的新数据集的每一行都有“ South”:
警告信息: 在if(x%in%c(“ Texas”,“ Oklahoma”,“ Arkansas”,“ Louisiana”,“ Mississippi” ,: 条件的长度> 1,并且只会使用第一个元素
您能给我解决的任何帮助将不胜感激
答案 0 :(得分:3)
请先不要在您的df$
内使用mutate
。大多数dplyr
动词功能的吸引人之处(和要点)之一是,它们不需要一直被告知数据集对象就可以工作。因此,您的通话应该如下所示(尽管仍然需要处理):
mutate(df, Region = regionconvert(AREA))
但是更进一步:如果/当您在管道中使用分组时,变量本身(如我在此处所示)是当前组的有效数据,而不是整个数据集。例如,如果我们想对汽车的mpg
进行排名,但要在每个气缸组中进行排名:
mtcars %>% group_by(cyl) %>% mutate(rnk = rank(mpg))
# # A tibble: 32 x 12
# # Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb rnk
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 5.5
# 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 5.5
# 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 3.5
# 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 7
# 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 13
# 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 2
# 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 4
# 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 5
# 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 3.5
# 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 3
# # ... with 22 more rows
然后将rank
调用三遍:第一次使用11个值(cyl == 4
),第二次使用7个值(cyl == 6
),第三次使用14个值({{1} }。如果相反,我们曾尝试致电:
cyl == 8
然后,对mtcars %>% group_by(cyl) %>% mutate(rnk = rank(mtcars$mpg))
的调用在每个调用中将具有32个值。 (这将失败,因为rank
需要每个函数调用返回1个值或与输入相同数量的值。)
但是如果您正在做类似的事情
mutate
然后,第一个将给出每个mtcars %>% group_by(cyl) %>% summarize(avg = mean(mpg))
mtcars %>% group_by(cyl) %>% summarize(avg = mean(mtcars$mpg))
的平均值,第二个将报告所有三个平均值相同的全局平均值。
好的,现在是您的问题:
一个问题是您的函数期望cyl
是一个奇异值(标量,从技术上讲,它在R中是长度为1的向量)。不幸的是,当由x
调用时,它传递了一个值向量。有几种处理方法,从最不喜欢的方法到最常见的方法:
向量化的最快方法是使用mutate
返回每个值的特定区域。不过,我建议在这里使用ifelse
,因为这样可以确保某些类型的保证(dplyr::if_else
不能保证)。
base::ifelse
预先填充完整的regionconvert2 <- function(x) {
if_else(x %in% c("Texas","Oklahoma","Arkansas","Louisiana","Mississippi","Alabama","Georgia","Florida","Tennessee","Kentucky","West Virginia","Virginia","North Carolina","South Carolina", "Maryland","Delaware"),
"South",
if_else(x %in% c("Maine","New Hampshire","Vermont","Massachusetts","Connecticut","Rhode Island","New York","New Jersey","Pennsylvannia"),
"Northeast",
if_else(x %in% c("Ohio","Michigan","Illinois","Indiana","Wisconsin","Minnesota","Iowa","Missouri","North Dakota","South Dakota","Nebraska","Kansas"),
"Midwest",
if_else(x %in% c("Alaska","Hawaii","Washington","Oregon","California","Nevada","Idaho","Utah","Arizona","New Mexico","Colorado","Wyoming","Montana"),
"West",
NA_character_))))
}
输出,然后在确定单个值时替换它们:
NA
坦率地说,我不太喜欢这样做,因为它很难编码(并且具有重复的代码),因此改进的版本如下:
regionconvert3 <- function(x) {
out <- x[NA]
ind <- x %in% c("Texas","Oklahoma","Arkansas","Louisiana","Mississippi","Alabama","Georgia","Florida","Tennessee","Kentucky","West Virginia","Virginia","North Carolina","South Carolina", "Maryland","Delaware")
out[ind] <- "South"
ind <- x %in% c("Maine","New Hampshire","Vermont","Massachusetts","Connecticut","Rhode Island","New York","New Jersey","Pennsylvannia")
out[ind] <- "Northeast"
ind <- x %in% c("Ohio","Michigan","Illinois","Indiana","Wisconsin","Minnesota","Iowa","Missouri","North Dakota","South Dakota","Nebraska","Kansas")
out[ind] <- "Midwest"
ind <- x %in% c("Alaska","Hawaii","Washington","Oregon","California","Nevada","Idaho","Utah","Arizona","New Mexico","Colorado","Wyoming","Montana")
out[ind] <- "West"
return(out)
}
第二个目的是用列表中条目的名称替换值(可能值的向量)。
与之前的技术略有相反,是提供了各种查找。我将修改上面的regionlist <- list(
South = c("Texas","Oklahoma","Arkansas","Louisiana","Mississippi","Alabama","Georgia","Florida","Tennessee","Kentucky","West Virginia","Virginia","North Carolina","South Carolina", "Maryland","Delaware"),
Northeast = c("Maine","New Hampshire","Vermont","Massachusetts","Connecticut","Rhode Island","New York","New Jersey","Pennsylvannia"),
Midwest = c("Ohio","Michigan","Illinois","Indiana","Wisconsin","Minnesota","Iowa","Missouri","North Dakota","South Dakota","Nebraska","Kansas"),
West = c("Alaska","Hawaii","Washington","Oregon","California","Nevada","Idaho","Utah","Arizona","New Mexico","Colorado","Wyoming","Montana")
)
regionconvert4 <- function(x, lookup) {
out <- x[NA]
for (nm in names(lookup)) {
ind <- x %in% lookup[[nm]]
out[ind] <- nm
}
return(out)
}
,而不是名称是区域,而是状态。 (这可以通过其他方式轻松创建。)
regionlist
这消除了对函数ala statelist <- setNames(names(tibble::deframe(regiondf)),
tibble::deframe(regiondf))
statelist[1:5]
# Texas Oklahoma Arkansas Louisiana Mississippi
# "South" "South" "South" "South" "South"
statelist[ c("Colorado","New Jersey") ]
# Colorado New Jersey
# "West" "Northeast"
的需求。
合并/联接。这稍微先进一点,但是从长期来看,我认为它更具可维护性(例如,您可以在简单的CSV或电子表格中维护状态/区域列表,这可能使编辑/更改/扩展更加容易)等)。我将从statelist[AREA]
对象创建新框架,但可以轻松地直接创建它,也可以通过更熟悉的方式创建它:
regionlist
现在,我将用一个简单的示例数据来演示所有这些功能。 (旁注:如果事情对您不起作用,则可能是因为我们没有您只有您知道的样本数据和/或任何细微差别。将来,请提供一些样本数据进行测试和您的预期输出。 )
regiondf <- tibble::enframe(regionlist, name="region", value="AREA") %>% tidyr::unnest()
regiondf
# # A tibble: 50 x 2
# region AREA
# <chr> <chr>
# 1 South Texas
# 2 South Oklahoma
# 3 South Arkansas
# 4 South Louisiana
# 5 South Mississippi
# 6 South Alabama
# 7 South Georgia
# 8 South Florida
# 9 South Tennessee
# 10 South Kentucky
# # ... with 40 more rows
(如果您想使用第四种“合并/联接”技术,则不需要sampledata <- data_frame(AREA = c("Colorado", "California", "New Jersey", "Florida", "Guam"))
sampledata %>%
mutate(
r2 = regionconvert2(AREA),
r3 = regionconvert3(AREA),
r4 = regionconvert4(AREA, regionlist),
r5 = statelist[AREA]
) %>%
left_join(regiondf, by = "AREA")
# # A tibble: 5 x 6
# AREA r2 r3 r4 r5 region
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 Colorado West West West West West
# 2 California West West West West West
# 3 New Jersey Northeast Northeast Northeast Northeast Northeast
# 4 Florida South South South South South
# 5 Guam <NA> <NA> <NA> <NA> <NA>
。)
答案 1 :(得分:0)
state.region
是一个因子向量,而不是一个函数。它有50个元素,按州名的字母顺序排列。要将数据与原始帖子中的数据集结合起来,可以将其与state.name
一起转换为小标题,如下所示。
library(tidyverse)
stateNames <- tibble(state = as.character(state.name),region = as.character(state.region))
head(stateNames)
...以及输出的前几行:
> head(stateNames)
# A tibble: 6 x 2
state region
<chr> <chr>
1 Alabama South
2 Alaska West
3 Arizona West
4 Arkansas South
5 California West
6 Colorado West
>
现在,状态信息可以与AREA
变量合并,如r2evans的答案所述。