我记录了一个较大的csv,如下所示。
name score year
team_a 4 2005
team_b 6 2005
team_a 2 2005
team_c 7 2005
team_d 3 2005
team_d 4 2005
team_a 2 2006
team_b 4 2006
team_b 3 2006
team_c 4 2006
team_c 2 2006
team_d 1 2006
team_e 5 2006
我想按以下方式进行下采样(添加每年的总分并记录下来)
name total_score year
team_a 6 2005
team_b 6 2005
team_c 7 2005
team_d 7 2005
team_a 2 2006
team_b 7 2006
team_c 6 2006
team_d 1 2006
team_e 5 2006
对此有何想法?
答案 0 :(得分:2)
我的答案是Python解决方案。首先请参阅熊猫手册,例如(https://pandas.pydata.org/docs/),然后尝试以下操作:
import pandas as pd
df = pd.read_csv(path_to_file)
df = df.groupby(["name","year"]).sum().reset_index()
但是如果文件很大,则分块方法可能会有用,请参阅:how to read only a chunk of csv file fast?
答案 1 :(得分:2)
用R
和tidyverse
的术语
这将是name
给定的组的分数汇总
和year
。
首先让我们生成示例数据。
team_score_year_df <- tibble::tribble(
~name,~score,~year,
"team_a", 4, 2005,
"team_b", 6, 2005,
"team_a", 2, 2005,
"team_c", 7, 2005,
"team_d", 3, 2005,
"team_d", 4, 2005,
"team_a", 2, 2006,
"team_b", 4, 2006,
"team_b", 3, 2006,
"team_c", 4, 2006,
"team_c", 2, 2006,
"team_d", 1, 2006,
"team_e", 5, 2006
)
现在,我们使用dplyr::group_by()
和dplyr::summarise()
来实现您的
预期的结果。
library(dplyr)
team_score_year_df %>%
group_by(name, year) %>%
summarise(total_score = sum(score)) %>%
select(name, total_score, year) # In case order of columns is important.
#> # A tibble: 9 x 3
#> # Groups: name [5]
#> name total_score year
#> <chr> <dbl> <dbl>
#> 1 team_a 6 2005
#> 2 team_a 2 2006
#> 3 team_b 6 2005
#> 4 team_b 7 2006
#> 5 team_c 7 2005
#> 6 team_c 6 2006
#> 7 team_d 7 2005
#> 8 team_d 1 2006
#> 9 team_e 5 2006
编辑:Base R解决方案
正如G5W在其评论中指出的那样,stats::aggregate()
也可以做到这一点。
result_df <- aggregate(
team_score_year_df$score,
list(name = team_score_year_df$name,
year = team_score_year_df$year),
sum
)
names(result_df)[3] <- "total_score"
result_df[c("name", "total_score", "year")]
#> name total_score year
#> 1 team_a 6 2005
#> 2 team_b 6 2005
#> 3 team_c 7 2005
#> 4 team_d 7 2005
#> 5 team_a 2 2006
#> 6 team_b 7 2006
#> 7 team_c 6 2006
#> 8 team_d 1 2006
#> 9 team_e 5 2006
答案 2 :(得分:0)
假设您的数据采用csv格式,这是基本的R解决方案:
scores = read.csv("scores.csv")
result = aggregate(data=scores, score ~ name+year, FUN=sum)
colnames(result)[3] = "total_score"