使用函数和mutate在R中创建新列

时间:2018-10-13 21:21:50

标签: r function mutate

我有一个学校项目,而我花了三个多小时试图解决这个问题。我的数据集的第一个变量(“ df”)是“ AREA”。我已经成功过滤掉它,因此唯一的值就是美国各州的名称。

我正在寻找一个新的列/变量,叫做“区域”。它采用列在“ AREA”中的州,并返回四个美国人口普查地区名称之一。显然,R中已经有一个现有函数(state.region?),但是我无法使其正常工作,我宁愿对其进行长期编码。

这是清理数据并安装“ dplyr”,“ tidyr”和“ stringr”库之后的内容:

#Create U.S. Census regions
regionconvert<-function(x)
{
  if(x %in% c("Texas","Oklahoma","Arkansas","Louisiana","Mississippi","Alabama","Georgia","Florida","Tennessee","Kentucky","West Virginia","Virginia","North Carolina","South Carolina", "Maryland","Delaware"))
    {return("South")}
  if(x %in% c("Maine","New Hampshire","Vermont","Massachusetts","Connecticut","Rhode Island","New York","New Jersey","Pennsylvannia"))
    {return("Northeast")}
  if(x %in% c("Ohio","Michigan","Illinois","Indiana","Wisconsin","Minnesota","Iowa","Missouri","North Dakota","South Dakota","Nebraska","Kansas"))
    {return("Midwest")}
  if(x %in% c("Alaska","Hawaii","Washington","Oregon","California","Nevada","Idaho","Utah","Arizona","New Mexico","Colorado","Wyoming","Montana"))
    {return("West")}
}
dfRegion=mutate(df,"Region"=regionconvert(df$AREA))

我遇到以下错误,我的新数据集的每一行都有“ South”:

警告信息: 在if(x%in%c(“ Texas”,“ Oklahoma”,“ Arkansas”,“ Louisiana”,“ Mississippi” ,:   条件的长度> 1,并且只会使用第一个元素

您能给我解决的任何帮助将不胜感激

2 个答案:

答案 0 :(得分:3)

请先不要在您的df$内使用mutate 。大多数dplyr动词功能的吸引人之处(和要点)之一是,它们不需要一直被告知数据集对象就可以工作。因此,您的通话应该如下所示(尽管仍然需要处理):

mutate(df, Region = regionconvert(AREA))

但是更进一步:如果/当您在管道中使用分组时,变量本身(如我在此处所示)是当前组的有效数据,而不是整个数据集。例如,如果我们想对汽车的mpg进行排名,但要在每个气缸组中进行排名:

mtcars %>% group_by(cyl) %>% mutate(rnk = rank(mpg))
# # A tibble: 32 x 12
# # Groups:   cyl [3]
#      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb   rnk
#    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4   5.5
#  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4   5.5
#  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1   3.5
#  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1   7  
#  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2  13  
#  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1   2  
#  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4   4  
#  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2   5  
#  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2   3.5
# 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4   3  
# # ... with 22 more rows

然后将rank调用三遍:第一次使用11个值(cyl == 4),第二次使用7个值(cyl == 6),第三次使用14个值({{1} }。如果相反,我们曾尝试致电:

cyl == 8

然后,对mtcars %>% group_by(cyl) %>% mutate(rnk = rank(mtcars$mpg)) 的调用在每个调用中将具有32个值。 (这将失败,因为rank需要每个函数调用返回1个值或与输入相同数量的值。)

但是如果您正在做类似的事情

mutate

然后,第一个将给出每个mtcars %>% group_by(cyl) %>% summarize(avg = mean(mpg)) mtcars %>% group_by(cyl) %>% summarize(avg = mean(mtcars$mpg)) 的平均值,第二个将报告所有三个平均值相同的全局平均值。


好的,现在是您的问题:

一个问题是您的函数期望cyl是一个奇异值(标量,从技术上讲,它在R中是长度为1的向量)。不幸的是,当由x调用时,它传递了一个值向量。有几种处理方法,从最不喜欢的方法到最常见的方法:

  1. 向量化的最快方法是使用mutate返回每个值的特定区域。不过,我建议在这里使用ifelse,因为这样可以确保某些类型的保证(dplyr::if_else不能保证)。

    base::ifelse
  2. 预先填充完整的regionconvert2 <- function(x) { if_else(x %in% c("Texas","Oklahoma","Arkansas","Louisiana","Mississippi","Alabama","Georgia","Florida","Tennessee","Kentucky","West Virginia","Virginia","North Carolina","South Carolina", "Maryland","Delaware"), "South", if_else(x %in% c("Maine","New Hampshire","Vermont","Massachusetts","Connecticut","Rhode Island","New York","New Jersey","Pennsylvannia"), "Northeast", if_else(x %in% c("Ohio","Michigan","Illinois","Indiana","Wisconsin","Minnesota","Iowa","Missouri","North Dakota","South Dakota","Nebraska","Kansas"), "Midwest", if_else(x %in% c("Alaska","Hawaii","Washington","Oregon","California","Nevada","Idaho","Utah","Arizona","New Mexico","Colorado","Wyoming","Montana"), "West", NA_character_)))) } 输出,然后在确定单个值时替换它们:

    NA

    坦率地说,我不太喜欢这样做,因为它很难编码(并且具有重复的代码),因此改进的版本如下:

    regionconvert3 <- function(x) {
      out <- x[NA]
      ind <- x %in% c("Texas","Oklahoma","Arkansas","Louisiana","Mississippi","Alabama","Georgia","Florida","Tennessee","Kentucky","West Virginia","Virginia","North Carolina","South Carolina", "Maryland","Delaware")
      out[ind] <- "South"
      ind <- x %in% c("Maine","New Hampshire","Vermont","Massachusetts","Connecticut","Rhode Island","New York","New Jersey","Pennsylvannia")
      out[ind] <- "Northeast"
      ind <- x %in% c("Ohio","Michigan","Illinois","Indiana","Wisconsin","Minnesota","Iowa","Missouri","North Dakota","South Dakota","Nebraska","Kansas")
      out[ind] <- "Midwest"
      ind <- x %in% c("Alaska","Hawaii","Washington","Oregon","California","Nevada","Idaho","Utah","Arizona","New Mexico","Colorado","Wyoming","Montana")
      out[ind] <- "West"
      return(out)
    }
    

    第二个目的是用列表中条目的名称替换值(可能值的向量)。

  3. 与之前的技术略有相反,是提供了各种查找。我将修改上面的regionlist <- list( South = c("Texas","Oklahoma","Arkansas","Louisiana","Mississippi","Alabama","Georgia","Florida","Tennessee","Kentucky","West Virginia","Virginia","North Carolina","South Carolina", "Maryland","Delaware"), Northeast = c("Maine","New Hampshire","Vermont","Massachusetts","Connecticut","Rhode Island","New York","New Jersey","Pennsylvannia"), Midwest = c("Ohio","Michigan","Illinois","Indiana","Wisconsin","Minnesota","Iowa","Missouri","North Dakota","South Dakota","Nebraska","Kansas"), West = c("Alaska","Hawaii","Washington","Oregon","California","Nevada","Idaho","Utah","Arizona","New Mexico","Colorado","Wyoming","Montana") ) regionconvert4 <- function(x, lookup) { out <- x[NA] for (nm in names(lookup)) { ind <- x %in% lookup[[nm]] out[ind] <- nm } return(out) } ,而不是名称是区域,而是状态。 (这可以通过其他方式轻松创建。)

    regionlist

    这消除了对函数ala statelist <- setNames(names(tibble::deframe(regiondf)), tibble::deframe(regiondf)) statelist[1:5] # Texas Oklahoma Arkansas Louisiana Mississippi # "South" "South" "South" "South" "South" statelist[ c("Colorado","New Jersey") ] # Colorado New Jersey # "West" "Northeast" 的需求。

  4. 合并/联接。这稍微先进一点,但是从长期来看,我认为它更具可维护性(例如,您可以在简单的CSV或电子表格中维护状态/区域列表,这可能使编辑/更改/扩展更加容易)等)。我将从statelist[AREA]对象创建新框架,但可以轻松地直接创建它,也可以通过更熟悉的方式创建它:

    regionlist

现在,我将用一个简单的示例数据来演示所有这些功能。 (旁注:如果事情对您不起作用,则可能是因为我们没有您只有您知道的样本数据和/或任何细微差别。将来,请提供一些样本数据进行测试和您的预期输出。 )

regiondf <- tibble::enframe(regionlist, name="region", value="AREA") %>% tidyr::unnest()
regiondf
# # A tibble: 50 x 2
#    region AREA       
#    <chr>  <chr>      
#  1 South  Texas      
#  2 South  Oklahoma   
#  3 South  Arkansas   
#  4 South  Louisiana  
#  5 South  Mississippi
#  6 South  Alabama    
#  7 South  Georgia    
#  8 South  Florida    
#  9 South  Tennessee  
# 10 South  Kentucky   
# # ... with 40 more rows

(如果您想使用第四种“合并/联接”技术,则不需要sampledata <- data_frame(AREA = c("Colorado", "California", "New Jersey", "Florida", "Guam")) sampledata %>% mutate( r2 = regionconvert2(AREA), r3 = regionconvert3(AREA), r4 = regionconvert4(AREA, regionlist), r5 = statelist[AREA] ) %>% left_join(regiondf, by = "AREA") # # A tibble: 5 x 6 # AREA r2 r3 r4 r5 region # <chr> <chr> <chr> <chr> <chr> <chr> # 1 Colorado West West West West West # 2 California West West West West West # 3 New Jersey Northeast Northeast Northeast Northeast Northeast # 4 Florida South South South South South # 5 Guam <NA> <NA> <NA> <NA> <NA> 。)

答案 1 :(得分:0)

state.region是一个因子向量,而不是一个函数。它有50个元素,按州名的字母顺序排列。要将数据与原始帖子中的数据集结合起来,可以将其与state.name一起转换为小标题,如下所示。

library(tidyverse)
stateNames <- tibble(state = as.character(state.name),region = as.character(state.region))
head(stateNames)

...以及输出的前几行:

> head(stateNames)
# A tibble: 6 x 2
  state      region
  <chr>      <chr> 
1 Alabama    South 
2 Alaska     West  
3 Arizona    West  
4 Arkansas   South 
5 California West  
6 Colorado   West  
>

现在,状态信息可以与AREA变量合并,如r2evans的答案所述。