跨行汇总数据框

时间:2018-09-16 10:54:45

标签: r dataframe dplyr

具有如下所示的数据帧dp

text <- "
State,District,County,Num Voters,Total Votes in State,Votes for None,Candidate Name,Party,Votes Scored
CA,San Diego,Delmar,190962,48026634,2511,A1,IND,949
CA,San Diego,Delmar,190962,48026634,2511,A2,RP(K),44815
CA,San Diego,Delmar,190962,48026634,2511,A3,IND,1036
CA,San Diego,Delmar,190962,48026634,2511,A4,DEM,29235
CA,San Diego,Delmar,190962,48026634,2511,A5,IND,5064
CA,San Diego,Delmar,190962,48026634,2511,A6,IND,803
CA,San Diego,Delmar,190962,48026634,2511,A7,REP,22329
CA,San Diego,Delmar,190962,48026634,2511,A8,BSP,43553
CA,San Diego,La Jolla,190257,48026634,3629,A1,IND,972
CA,San Diego,La Jolla,190257,48026634,3629,A2,RP(K),66168
CA,San Diego,La Jolla,190257,48026634,3629,A3,IND,2763
CA,San Diego,La Jolla,190257,48026634,3629,A4,DEM,32792
CA,San Diego,La Jolla,190257,48026634,3629,A5,IND,8629
CA,San Diego,La Jolla,190257,48026634,3629,A6,IND,1191
CA,San Diego,La Jolla,190257,48026634,3629,A7,REP,28002
CA,San Diego,La Jolla,190257,48026634,3629,A8,BSP,2555
"
df <- read.table(textConnection(text), sep = ",", header = TRUE)

我的数据包含五个政党:IND,RP(K),DEM,REP和BSP。我想创建两个新的得分列:

  • DRP:DEM得分+ RP(K)得分
  • RSP:REP得分+ BSP得分

此外,我想添加一些列来对这些分数在“地区”和“县”级别进行分组。

我最好如何使用dplyr进行处理。我在考虑group函数,但是还不能弄清楚它的逻辑。

2 个答案:

答案 0 :(得分:1)

通过使用dplyr,您可以执行以下操作。

tg <- df %>%
  group_by(County) %>%
  mutate(DRP_county = sum(Votes.Scored[Party == "RP(K)" | Party == "DEM"]),
         RSP_county = sum(Votes.Scored[Party == "REP" | Party == "BSP"])) %>%
  ungroup() %>% 
  group_by(District) %>%
  mutate(DRP_district = sum(Votes.Scored[Party == "RP(K)" | Party == "DEM"]),
         RSP_district = sum(Votes.Scored[Party == "REP" | Party == "BSP"]))

注意: 我认为最好将所有内容都保留在同一数据帧中,但这当然取决于数据大小。同样,为了将来对数据框进行分析以及出于模型/可视化的目的,最好使用mutate而不是summarise,尽管这样可以提供更清晰的输出。

此外,您可能会跳过ungroup(),但我相信将其包含在内会更安全。

答案 1 :(得分:1)

使用dplyr,如果您只需要两列,其中涉及双方的地区和县级总和:

df %>%
  mutate(Party2 = ifelse(Party == "DEM" | Party == "RP(K)", "DRP", 
                         ifelse(Party == "REP" | Party == "BSP", "RSP", paste(Party)))) %>%
  group_by(District, Party2) %>%
  mutate(Votes.Scored.District = sum(Votes.Scored)) %>%
  ungroup() %>%
  group_by(County, Party2) %>%
  mutate(Votes.Scored.County = sum(Votes.Scored)) 

或者,如果您希望获得有关地区和县级政党的整体统计数据:

df %>%
  mutate(Party2 = ifelse(Party == "DEM" | Party == "RP(K)", "DRP", 
                         ifelse(Party == "REP" | Party == "BSP", "RSP", paste(Party)))) %>%
  group_by(District, Party2) %>%
  mutate(Votes.Scored.District = sum(Votes.Scored)) %>%
  ungroup() %>%
  group_by(County, Party2) %>%
  mutate(Votes.Scored.County = sum(Votes.Scored)) %>%
  group_by(Party2) %>%
  summarise(Votes.Scored.District = min(Votes.Scored.District),
            Votes.Scored.County = min(Votes.Scored.County))

# A tibble: 3 x 3
  Party2 Votes.Scored.District Votes.Scored.County
  <chr>                  <dbl>               <dbl>
1 DRP                  173010.              74050.
2 IND                   21407.               7852.
3 RSP                   96439.              30557.