通过两个分组变量汇总数据

时间:2020-02-21 16:19:17

标签: python r

我记录了一个较大的csv,如下所示。

name             score          year
team_a           4              2005
team_b           6              2005
team_a           2              2005
team_c           7              2005
team_d           3              2005
team_d           4              2005

team_a           2              2006
team_b           4              2006
team_b           3              2006
team_c           4              2006
team_c           2              2006
team_d           1              2006
team_e           5              2006

我想按以下方式进行下采样(添加每年的总分并记录下来)

name           total_score    year
team_a         6              2005
team_b         6              2005
team_c         7              2005
team_d         7              2005

team_a         2              2006
team_b         7              2006
team_c         6              2006
team_d         1              2006
team_e         5              2006

对此有何想法?

3 个答案:

答案 0 :(得分:2)

我的答案是Python解决方案。首先请参阅熊猫手册,例如(https://pandas.pydata.org/docs/),然后尝试以下操作:

import pandas as pd
df = pd.read_csv(path_to_file)
df = df.groupby(["name","year"]).sum().reset_index()

但是如果文件很大,则分块方法可能会有用,请参阅:how to read only a chunk of csv file fast?

答案 1 :(得分:2)

Rtidyverse的术语 这将是name给定的组的分数汇总 和year

首先让我们生成示例数据。

team_score_year_df <- tibble::tribble(
   ~name,~score,~year,
  "team_a", 4, 2005,
  "team_b", 6, 2005,
  "team_a", 2, 2005,
  "team_c", 7, 2005,
  "team_d", 3, 2005,
  "team_d", 4, 2005,
  "team_a", 2, 2006,
  "team_b", 4, 2006,
  "team_b", 3, 2006,
  "team_c", 4, 2006,
  "team_c", 2, 2006,
  "team_d", 1, 2006,
  "team_e", 5, 2006
  )

现在,我们使用dplyr::group_by()dplyr::summarise()来实现您的 预期的结果。

library(dplyr)

team_score_year_df %>% 
  group_by(name, year) %>% 
  summarise(total_score = sum(score)) %>% 
  select(name, total_score, year) # In case order of columns is important.
#> # A tibble: 9 x 3
#> # Groups:   name [5]
#>   name   total_score  year
#>   <chr>        <dbl> <dbl>
#> 1 team_a           6  2005
#> 2 team_a           2  2006
#> 3 team_b           6  2005
#> 4 team_b           7  2006
#> 5 team_c           7  2005
#> 6 team_c           6  2006
#> 7 team_d           7  2005
#> 8 team_d           1  2006
#> 9 team_e           5  2006

编辑:Base R解决方案

正如G5W在其评论中指出的那样,stats::aggregate()也可以做到这一点。

result_df <- aggregate(
  team_score_year_df$score,
  list(name = team_score_year_df$name,
       year = team_score_year_df$year),
  sum
)

names(result_df)[3] <- "total_score"

result_df[c("name", "total_score", "year")]
#>     name total_score year
#> 1 team_a           6 2005
#> 2 team_b           6 2005
#> 3 team_c           7 2005
#> 4 team_d           7 2005
#> 5 team_a           2 2006
#> 6 team_b           7 2006
#> 7 team_c           6 2006
#> 8 team_d           1 2006
#> 9 team_e           5 2006

答案 2 :(得分:0)

假设您的数据采用csv格式,这是基本的R解决方案:

scores = read.csv("scores.csv")
result = aggregate(data=scores, score ~ name+year, FUN=sum)
colnames(result)[3] = "total_score"