重新编码单个变量分布在R中的几个

时间:2014-10-15 03:04:59

标签: r survey qualtrics recode

我正在处理有关种族问题的调查数据。每个种族类别都是自己的变量。这就是我想要做的事情:

  1. 创建一个新变量p.race
  2. 为种族/族裔的八个变量中的一个分配p.race(下方)。
  3. 在这种情况下,确定某个人是否标记了两场或更多场比赛并指定p.race值“两场或多场比赛”。
  4. 当他们表示这种族时,指定p.race“西班牙裔或拉丁裔”的价值。
  5. 创建一个新变量p.poc,以表明他们是否是有色人物(即白色,包括西班牙裔/拉丁裔)。这应该是0或1。
  6. 八种种族分别是白色*,黑色*,亚洲*,AIAN *,NHPI *,其他一些种族*,两个或更多种族*和西班牙裔;其中*表示西班牙裔或拉丁裔种族。


    这是我到目前为止解析“两场或更多场比赛”的原因:

    p['p.race'] <- NA # create new variable for race
    
    # list of variable names that store a string indicating the race
    ## e.g., `race_white` would be either blank or contain "White, European, Middle Eastern, or Caucasian"
    race.list <- c('p.race_white', 'p.race_black', 'p.race_asian', 'p.race_aian', 'p.race_nhpi', 'p.race_other')
    
    # iterate through each record
    for ( n in 1:length(p) ) {
      multiflag = 0
    
      # iterate through the race list
      for ( i in race.list ) {
    
        # if it is not blank, +1 to multiflag
        if ( p$i[n] != '' ) {
          multiflag <- multiflag + 1
        }
      }
    
      # if multiflag was flagged more than once, assign "Two or more races" to `race`
      if ( multiflag > 1 ) {
        p$p.race[n] <- 'Two or more races'
      }
    }
    

    执行时,会返回以下错误:

    > Error in if (p$i[n] != "") { : argument is of length zero
    

    这是我的poc变量编码,错误如下:

    p['p.poc'] <- 0 # create a new variable for whether they are a person of color
    for ( n in 1:length(p) ) {
      if ( p$p.race_black[n] == 'Black, African-American, or African'
           | p$p.race_asian[n] == 'Asian or Asian-American'
           | p$p.race_aian[n] == 'American Indian or Alaskan Native'
           | p$p.race_nhpi[n] == 'Native Hawaiian or other Pacific Islander'
           | p$p.race_other[n] == 'Other (please specify)'
           | p$p.hispanic[n] == 'Yes') {
        p$p.poc[n] <- 1
      }
    }
    
    > Error in if (p$p.race_black[n] == "Black, African-American, or African" |  : 
      missing value where TRUE/FALSE needed
    

    我真的不知道在哪里开始为新的race变量分配八个种族类别中的一个,而不是使用非常长代码。


    如果有帮助,以下是调查问题:

    Q1。你认为自己是西班牙裔,拉丁裔或西班牙裔吗?

    • 没有

    Q2。你确定哪个种族(检查所有适用的)?

    • 白人,欧洲人,中东人或白种人
    • 黑人,非裔美国人或非洲人
    • 亚洲或亚裔美国人
    • 美洲印第安人或阿拉斯加原住民
    • 夏威夷原住民或其他太平洋岛民
    • 其他(请注明)

    以下是示例输出(文本截断):

    > p[264:271]
    #    
    #      p.hispanic  p.race_white p.race_black p.race_asian p.race_aian p.race_nhpi p.race_other
    #   1  Yes         White
    #   2  No          White
    #   3  No                       Black
    #   4  No          White                     Asian
    #   5  Yes                                                                        Some other race
    

    这是一个dput输出:

    > dput(p[264:270])
    structure(list(p.hispanic = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 
    2L, 2L, 2L, 2L, 3L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("", "No", "Yes"
    ), class = "factor"), p.race_white = structure(c(2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 
    1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 
    2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L), .Label = c("", 
    "White, European, Middle Eastern, or Caucasian"), class = "factor"), 
        p.race_black = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
        1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
        "Black, African-American, or African"), class = "factor"), 
        p.race_asian = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L), .Label = c("", 
        "Asian or Asian-American"), class = "factor"), p.race_aian = structure(c(1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 
        1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L), .Label = c("", "American Indian or Alaskan Native"
        ), class = "factor"), p.race_nhpi = c(NA, NA, NA, NA, NA, 
        NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
        NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
        NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
        NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
        NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), 
        p.race_other = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 
        1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
        1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
        "Other (please specify)"), class = "factor")), .Names = c("p.hispanic", 
    "p.race_white", "p.race_black", "p.race_asian", "p.race_aian", 
    "p.race_nhpi", "p.race_other"), class = "data.frame", row.names = c(NA, 
    -79L))
    

2 个答案:

答案 0 :(得分:2)

这不是很优雅,但我认为它有效。使用循环,特别是嵌套循环,并不是很好的&#34; R&#34;因为它们很慢,但也有副作用,如混乱你的工作区。

如果没有指定种族,你可能想要改变对待p.poc的方式,因为它默认为1,这可能不是你想要的。

所以这是一种方式:

tmp <- lapply(1:nrow(p), function(ii) {
  ## this checks for columns that aren't blank or NA, takes the colname
  ## and strips off the prefix
  tmp <- gsub('p.race_', '', names(p)[which(p[ii, -1] != '' & !is.na(p[ii, -1])) + 1])

  ## some special cases for > 1 race and blanks and p.poc
  tmp <- ifelse(length(tmp) > 1, 'Two or more', tmp)
  tmp[is.na(tmp)] <- 'Not specified'
  tmp <- ifelse(p[ii, 1] %in% 'Yes', 'Hispanic or Latino', tmp)
  p.poc <- (!grepl('white', tmp)) * 1

  return(list(p.race = tmp, p.poc = p.poc))
})

head(do.call(rbind, tmp), 20)

#   p.race               p.poc
# [1,] "white"               0    
# [2,] "white"               0    
# [3,] "white"               0    
# [4,] "white"               0    
# [5,] "white"               0    
# [6,] "white"               0    
# [7,] "white"               0    
# [8,] "white"               0    
# [9,] "asian"               1    
# [10,] "white"              0    
# [11,] "other"              1    
# [12,] "white"              0    
# [13,] "white"              0    
# [14,] "white"              0    
# [15,] "Hispanic or Latino" 1    
# [16,] "white"              0    
# [17,] "white"              0    
# [18,] "white"              0    
# [19,] "white"              0    
# [20,] "white"              0   

## and combine back to the data frame
p <- cbind(p, do.call(rbind, tmp))

数据:

p <- structure(list(p.hispanic = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 
2L, 2L, 2L, 2L, 3L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("", "No", "Yes"
), class = "factor"), p.race_white = structure(c(2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 
2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L), .Label = c("", 
"White, European, Middle Eastern, or Caucasian"), class = "factor"), 
p.race_black = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
"Black, African-American, or African"), class = "factor"), 
p.race_asian = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L), .Label = c("", 
"Asian or Asian-American"), class = "factor"), p.race_aian = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 
1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = c("", "American Indian or Alaskan Native"
), class = "factor"), p.race_nhpi = c(NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), 
p.race_other = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", 
"Other (please specify)"), class = "factor")), .Names = c("p.hispanic", 
"p.race_white", "p.race_black", "p.race_asian", "p.race_aian", 
"p.race_nhpi", "p.race_other"), class = "data.frame", row.names = c(NA, 
  -79L))

答案 1 :(得分:1)

我的工作方式,如果数据是长格式而不是宽格式,这种任务总是更容易。但是,这意味着每个响应需要一个唯一的ID - 在这种情况下,您可以为每一行分配一个整数。

library(tidyr)
library(dplyr)

# Add individual ID to each row
p = mutate(p, id = 1:n())

完成后,我会做一些工作,使p.hispanic列看起来更像其他种族列,将数据集放在长格式中,删除所有NA /空白,然后制作两个新变量。创建新变量后,可以将它们连接到原始变量。我使用包 tidyr 进行重新整形,使用 dplyr 进行操作。

p %>%
    mutate(p.hispanic = ifelse(p.hispanic == "No", NA, "Hispanic or Latino")) %>% # change p.hispanic column
    gather(category, answer, p.hispanic:p.race_other, na.rm = TRUE) %>%
    filter(answer != "") %>% # get rid of blanks (if were NA would have removed in "gather")
    group_by(id) %>%
    # Create new variable p.race and p.pop based on rules
    mutate(p.race = ifelse(n_distinct(answer) > 1, "Two or more races", answer),
          p.poc = as.integer(p.race == "White, European, Middle Eastern, or Caucasian")) %>%
    slice(1) %>% # take only 1 record for the duplicate id's
    select(-category, - answer) %>% # remove columns that aren't needed
    left_join(p, ., by = "id") %>% # join new columns with original dataset
    select(-id) # remove ID column if not wanted

获得此数据集之后,如果您希望级别以某种方式显示,则可以使用p.race重置factor的级别。