在R中尝试spread()时求和重复键

时间:2018-03-19 01:49:55

标签: r dplyr tidyr

我正在努力学习R,我决定通过构建一个东西来解读我的州在选举之夜提出的现场选举结果。不幸的是,我在计算用于地图填充的Margin值方面遇到了麻烦。我的州(WA)使用前2名小学,这意味着在一些比赛中,11月选举中有两个同一党派人士。这可能是太多的背景,但无论如何这里是编码问题:

我有一个如下所示的数据框:

Dist    Party                       Votes
1       (Prefers Democratic Party)  124151
1       (Prefers Republican Party)  101428
2       (Prefers Democratic Party)  122173
2       (Prefers Republican Party)  79518
3       (Prefers Republican Party)  124796
3       (Prefers Democratic Party)  78018
4       (Prefers Republican Party)  75307
4       (Prefers Republican Party)  77772
5       (Prefers Republican Party)  135470
5       (Prefers Democratic Party)  87772
6       (Prefers Democratic Party)  141265
6       (Prefers Republican Party)  83025
7       (Prefers Democratic Party)  203954
7       (Prefers Republican Party)  47921
8       (Prefers Republican Party)  125741
8       (Prefers Democratic Party)  73003
9       (Prefers Democratic Party)  118132
9       (Prefers Republican Party)  48662
10      (Prefers Democratic Party)  99279
10      (Prefers Republican Party)  82213

我想让它看起来像这样:

Dist    (Prefers Democratic Party)  (Prefers Republican Party)
1       124151                      101428
2       122173                      79518
3       78018                       124796
4       [NA or 0]                   153079
5       87772                       135470
6       141265                      83025
7       203954                      47921
8       73003                       125741
9       118132                      48662
10      99279                       82213
由于spread()中的重复,

Dist = 4无效。我已经设法将这些问题放在这里,但我对此并不满意,而且我几乎是积极的,这是一个更好的方法

library(tidyr)
library(dplyr)

CongressTidy %>%
  group_by(Dist) %>%
  mutate(GOPVotes = sum(ifelse(Party == "(Prefers Republican Party)", Votes, 0))) %>%
  mutate(DemVotes = sum(ifelse(Party == "(Prefers Democratic Party)", Votes, 0)))

返回:

Dist    Party                       Votes   GOPVotes    DemVotes
<fctr>  <fctr>                      <int>   <dbl>       <dbl>
1       (Prefers Democratic Party)  124151  101428      124151
1       (Prefers Republican Party)  101428  101428      124151
2       (Prefers Democratic Party)  122173  79518       122173
2       (Prefers Republican Party)  79518   79518       122173
3       (Prefers Republican Party)  124796  124796      78018
3       (Prefers Democratic Party)  78018   124796      78018
4       (Prefers Republican Party)  75307   153079      0
4       (Prefers Republican Party)  77772   153079      0
5       (Prefers Republican Party)  135470  135470      87772
5       (Prefers Democratic Party)  87772   135470      87772
6       (Prefers Democratic Party)  141265  83025       141265
6       (Prefers Republican Party)  83025   83025       141265
7       (Prefers Democratic Party)  203954  47921       203954
7       (Prefers Republican Party)  47921   47921       203954
8       (Prefers Republican Party)  125741  125741      73003
8       (Prefers Democratic Party)  73003   125741      73003
9       (Prefers Democratic Party)  118132  48662       118132
9       (Prefers Republican Party)  48662   48662       118132
10      (Prefers Democratic Party)  99279   82213       99279
10      (Prefers Republican Party)  82213   82213       99279

这很好,就目前而言,我可以添加选择器列并选择:

CongressMargins <- CongressTidy  %>%
  group_by(Dist) %>%
  mutate(GOPVotes = sum(ifelse(Party == "(Prefers Republican Party)", Votes, 0))) %>%
  mutate(DemVotes = sum(ifelse(Party == "(Prefers Democratic Party)", Votes, 0))) %>%
  mutate(selector = c(1,2)) %>%
  subset(selector == 1, select = c(Dist, GOPVotes, DemVotes))

这给了我想要的东西,我可以从那里计算保证金:

Dist    GOPVotes    DemVotes
<fctr>  <dbl>       <dbl>
1       101428      124151      
2       79518       122173      
3       124796      78018       
4       153079      0       
5       135470      87772       
6       83025       141265      
7       47921       203954      
8       125741      73003       
9       48662       118132      
10      82213       99279   

但是如果有2个无人反对的比赛会被搞砸,因为它是基于矢量回收。它只是丑陋。并且必须有更好的方法。任何想法?

2 个答案:

答案 0 :(得分:3)

我们可以先计算群数,然后再推广。如果您希望缺少的单元格为0,请使用spread(Party, Votes, fill = 0)

library(tidyverse)

dat2 <- dat %>%
  group_by(Dist, Party) %>%
  summarise(Votes = sum(Votes)) %>%
  spread(Party, Votes) %>%
  ungroup()
dat2
# # A tibble: 10 x 3
#     Dist `(Prefers Democratic Party)` `(Prefers Republican Party)`
#    <int>                        <int>                        <int>
#  1     1                       124151                       101428
#  2     2                       122173                        79518
#  3     3                        78018                       124796
#  4     4                           NA                       153079
#  5     5                        87772                       135470
#  6     6                       141265                        83025
#  7     7                       203954                        47921
#  8     8                        73003                       125741
#  9     9                       118132                        48662
# 10    10                        99279                        82213

数据

dat <- read.table(text = "Dist    Party                       Votes
1       '(Prefers Democratic Party)'  124151
                  1       '(Prefers Republican Party)'  101428
                  2       '(Prefers Democratic Party)'  122173
                  2       '(Prefers Republican Party)'  79518
                  3       '(Prefers Republican Party)'  124796
                  3       '(Prefers Democratic Party)'  78018
                  4       '(Prefers Republican Party)'  75307
                  4       '(Prefers Republican Party)'  77772
                  5       '(Prefers Republican Party)'  135470
                  5       '(Prefers Democratic Party)'  87772
                  6       '(Prefers Democratic Party)'  141265
                  6       '(Prefers Republican Party)'  83025
                  7       '(Prefers Democratic Party)'  203954
                  7       '(Prefers Republican Party)'  47921
                  8       '(Prefers Republican Party)'  125741
                  8       '(Prefers Democratic Party)'  73003
                  9       '(Prefers Democratic Party)'  118132
                  9       '(Prefers Republican Party)'  48662
                  10      '(Prefers Democratic Party)'  99279
                  10      '(Prefers Republican Party)'  82213",
                  header = TRUE, stringsAsFactors = FALSE)

答案 1 :(得分:1)

您可以使用dcast包中的reshape2指定聚合函数为sum

 library(reshape2)
 dcast(dat,Dist~Party,sum,value.var = "Votes")


   Dist (Prefers Democratic Party) (Prefers Republican Party)
1     1                     124151                     101428
2     2                     122173                      79518
3     3                      78018                     124796
4     4                          0                     153079
5     5                      87772                     135470
6     6                     141265                      83025
7     7                     203954                      47921
8     8                      73003                     125741
9     9                     118132                      48662
10   10                      99279                      82213

使用基数R:

xtabs(Votes~Dist+Party,dat)
    Party
Dist (Prefers Democratic Party) (Prefers Republican Party)
  1                      124151                     101428
  2                      122173                      79518
  3                       78018                     124796
  4                           0                     153079
  5                       87772                     135470
  6                      141265                      83025
  7                      203954                      47921
  8                       73003                     125741
  9                      118132                      48662
  10                      99279                      82213

以上输出属于table类,您可以通过以下方式将其设为数据框:

as.data.frame.matrix(xtabs(Votes~Dist+Party,dat))现在这是一个数据框,您可以按照自己想要的方式进行分组