R(tidyverse) - 用于卡方检验独立性的2个分类变量的聚合数据的列总和?

时间:2018-01-08 16:41:56

标签: r dplyr

有人可以请他们给我建议吗?

  1. 我希望总结我的专栏总数。
  2. 我需要框架进行Chi square独立测试,所以如果有更快的方法请赐教我!
  3. 这样做的最佳方式是什么?

    我尝试使用ColSums但它给了我一个错误(colSums中的错误(。,mpaa_rating,na.rm = FALSE,dims = 1):未使用的参数(mpaa_rating)。 我显然没有正确使用它或在正确的地方输入它。我试过:colSums(mpaa_rating,na.rm = FALSE,dims = 1)%>%就在传播之上。

    非常感谢, 克里斯汀

    rereprex::reprex_info() 
    movie_help<- data.frame(tribble(
                 ~mpaa_rating,                       ~genre,
                         "PG",         "Action & Adventure",
                          "R",         "Mystery & Suspense",
                          "R",                      "Drama",
                          "R",                      "Drama",
                          "R",                      "Drama",
                         "PG",         "Action & Adventure",
                      "PG-13",                     "Comedy",
                          "R",                     "Comedy",
                          "R",         "Action & Adventure",
                          "R",                      "Drama",
                          "R",                      "Drama",
                          "G",                      "Drama",
                          "R",                     "Comedy",
                          "R",                      "Drama",
                          "R",         "Mystery & Suspense",
                          "R",  "Musical & Performing Arts",
                    "Unrated",                      "Drama",
                          "R",                      "Drama",
                      "PG-13",                      "Drama",
                      "PG-13",                      "Drama"
                 )) 
    movie_help %>% 
    filter(!is.na(genre), !is.na(mpaa_rating)) %>% 
    count(genre, mpaa_rating) %>%
    group_by(genre) %>%
    mutate(prop = n) %>%
    mutate(Total= sum(n)) %>%
    select(-n) %>%
    spread(key = mpaa_rating, value = prop) 
    #> # A tibble: 5 x 7
    #> # Groups:   genre [5]
    #>                       genre Total     G    PG `PG-13`     R Unrated
    #> *                     <chr> <int> <int> <int>   <int> <int>   <int>
    #> 1        Action & Adventure     3    NA     2      NA     1      NA
    #> 2                    Comedy     3    NA    NA       1     2      NA
    #> 3                     Drama    11     1    NA       2     7       1
    #> 4 Musical & Performing Arts     1    NA    NA      NA     1      NA
    #> 5        Mystery & Suspense     2    NA    NA      NA     2      NA
    

2 个答案:

答案 0 :(得分:1)

为了得到底部的总和,我喜欢使用janitor包中的janitor::adorn_totals函数。 janitor包有许多小辅助函数,适用于您希望以所需方式清理表的情况。详细了解here。我最喜欢的还有janitor::clean_names,它可以帮助您统一清理列名称。

现在你可以简单地说:

 movie_help %>% 
    filter(!is.na(genre), !is.na(mpaa_rating)) %>% 
    count(genre, mpaa_rating) %>% 
    group_by(genre) %>%
    mutate(prop = n) %>%
    mutate(Total= sum(n)) %>%  
    select(-n) %>%
    spread(key = mpaa_rating, value = prop, fill = 0) %>% 
    janitor::adorn_totals('row') %>% 
    janitor::clean_names() 

答案 1 :(得分:0)

我们可以使用tablechisq.test来执行您想要的测试:

chisq.test(table(movie_help))

我们也可以手动计算总数:

dat <- movie_help %>%
  filter(!is.na(genre),!is.na(mpaa_rating)) %>%
  count(genre, mpaa_rating) %>%
  group_by(genre) %>%
  mutate(prop = n) %>%
  mutate(Total = sum(n)) %>%
  select(-n) %>%
  spread(key = mpaa_rating, value = prop) 

bind_rows(dat, 
          cbind(data_frame('genre' = 'Total'), summarise_all(dat[,-1], sum, na.rm = T)))

  genre                     Total     G    PG `PG-13`     R Unrated
  <chr>                     <int> <int> <int>   <int> <int>   <int>
1 Action & Adventure            3    NA     2      NA     1      NA
2 Comedy                        3    NA    NA       1     2      NA
3 Drama                        11     1    NA       2     7       1
4 Musical & Performing Arts     1    NA    NA      NA     1      NA
5 Mystery & Suspense            2    NA    NA      NA     2      NA
6 Total                        20     1     2       3    13       1