如何过滤数据集中两个分类变量的频率?

时间:2019-04-28 22:43:08

标签: r dplyr

我对弄清失业者的频率很感兴趣,他们在我的数据集中也是非裔美国人/黑人。我有一个很大的数据集,其中包括变量OCC(失业人员被编码为0)和种族(AA / Black被编码为2)。

我试图通过tidyverse使用group(by)函数,但是我认为我可能做错了,因为我收到以下错误消息。

这是代码:

RACE <- group_by(cps_data, OCC, RACE)
occupation <- summarise(RACE,
                   count = n(),
                   OCC = mean(OCC, na.rm = TRUE)
)


summarise(RACE, occupation = mean(OCC, na.rm = TRUE))

我创建的职业对象给我错误消息:

Error in summarise_impl(.data, dots) : 
  Column `OCC` can't be modified because it's a grouping variable

summary函数给了我一点点微妙的提示:

# A tibble: 1,374 x 3
# Groups:   OCC [?]
     OCC  RACE occupation
   <int> <int>      <dbl>
 1     0     1          0
 2     0     2          0
 3     0     3          0
 4     0     4          0
 5     0     5          0
 6     0     6          0
 7     0     7          0
 8     0     8          0
 9     0     9          0
10    10     1         10

以下是我的一些数据-我试图为你们复制以帮助您。您将看到上面我制作了另一个数据框,仅包含OCC和RACE,因为这是目前唯一相关的因素。

dput(head(cps_data,4))
structure(list(YEAR = c(2015L, 2015L, 2015L, 2015L), DATANUM = c(1L, 
1L, 1L, 1L), SERIAL = c(1029644L, 1029644L, 1029705L, 1029708L
), CBSERIAL = c(403, 403, 1944, 1964), HHWT = c(194L, 194L, 142L, 
77L), STATEICP = c(14L, 14L, 14L, 14L), STATEFIP = c(42L, 42L, 
42L, 42L), CITY = c(5330L, 5330L, 5330L, 5330L), GQ = c(1L, 1L, 
1L, 1L), PERNUM = c(1L, 3L, 1L, 1L), PERWT = c(194L, 140L, 142L, 
78L), SEX = c(2L, 1L, 2L, 1L), AGE = c(37L, 35L, 60L, 41L), RACE = c(1L, 
1L, 2L, 2L), RACED = c(100L, 100L, 200L, 200L), OCC = c(800L, 
6260L, 0L, 350L), IND = c(7270L, 770L, 0L, 8190L), INCWAGE = c(75000L, 
25000L, 0L, 83000L)), row.names = c(NA, 4L), class = "data.frame")

我希望获得一个输出,以显示我失业的人数,这些人也可以识别为非裔美国人/黑人,因此我可以比较我的数据集。

1 个答案:

答案 0 :(得分:0)

如果我对你的理解正确,那你就快到了。

df %>%
    group_by(OCC, RACE) %>%
    summarize(count = n())

# A tibble: 4 x 3
# Groups:   OCC [4]
    OCC  RACE count
  <int> <int> <int>
1     0     2     1
2   350     2     1
3   800     1     1
4  6260     1     1

数据

library(tidyverse)
df <- structure(list(YEAR = c(2015L, 2015L, 2015L, 2015L), DATANUM = c(1L, 
    1L, 1L, 1L), SERIAL = c(1029644L, 1029644L, 1029705L, 1029708L
    ), CBSERIAL = c(403, 403, 1944, 1964), HHWT = c(194L, 194L, 142L, 
    77L), STATEICP = c(14L, 14L, 14L, 14L), STATEFIP = c(42L, 42L, 
    42L, 42L), CITY = c(5330L, 5330L, 5330L, 5330L), GQ = c(1L, 1L, 
    1L, 1L), PERNUM = c(1L, 3L, 1L, 1L), PERWT = c(194L, 140L, 142L, 
    78L), SEX = c(2L, 1L, 2L, 1L), AGE = c(37L, 35L, 60L, 41L), RACE = c(1L, 
    1L, 2L, 2L), RACED = c(100L, 100L, 200L, 200L), OCC = c(800L, 
    6260L, 0L, 350L), IND = c(7270L, 770L, 0L, 8190L), INCWAGE = c(75000L, 
    25000L, 0L, 83000L)), row.names = c(NA, 4L), class = "data.frame")