我有一个包含2列和3659行df
我试图通过平均此数据框中的每10行或13行来减少数据集,所以我尝试了以下内容:
# number of rows per group
n=13
# number of groups
n_grp=nrow(df)/n
round(n_grp,0)
# row indices (one vector per group)
idx_grp <- split(seq(df), rep(seq(n_grp), each = n))
# calculate the col means for all groups
res <- lapply(idx_grp, function(i) {
# subset of the data frame
tmp <- dat[i]
# calculate row means
colMeans(tmp, na.rm = TRUE)
})
# transform list into a data frame
dat2 <- as.data.frame(res)
但是,我不能将行数除以10或13,因为数据长度不是拆分变量的倍数。所以我不确定应该做什么(我只想要计算最后一组的平均值 - 即使少于10个元素)
我也试过这个,但结果是一样的:
df1=split(df, sample(rep(1:301, 10)))
答案 0 :(得分:10)
以下是使用aggregate()
和rep()
的解决方案。
df <- data.frame(a=1:12, b=13:24 );
df;
## a b
## 1 1 13
## 2 2 14
## 3 3 15
## 4 4 16
## 5 5 17
## 6 6 18
## 7 7 19
## 8 8 20
## 9 9 21
## 10 10 22
## 11 11 23
## 12 12 24
n <- 5;
aggregate(df,list(rep(1:(nrow(df)%/%n+1),each=n,len=nrow(df))),mean)[-1];
## a b
## 1 3.0 15.0
## 2 8.0 20.0
## 3 11.5 23.5
此解决方案的一个重要部分是nrow(df)
n
len
处理length.out
的不可分性问题,指定rep()
参数(实际上参数名称为ResultSet resSet;
resSet = statement.executeQuery("SELECT COUNT(*) FROM table");
resSet.next()
long rowCount = resSet.getLong(1);
resSet = statement.executeQuery("SELECT * FROM table");
// read data of known row count...
} )connection.setAutoCommit(false);
connection.setTransactionIsolation(Connection.TRANSACTION_SERIALIZABLE)
// call 2 SQL queries above
connection.commit();
,它自动将组矢量限制为适当的长度。
答案 1 :(得分:7)
如果df
是data.table,您可以使用%/%
分组,如
library(data.table)
setDT(df)
n <- 13 # every 13 rows
df[, mean(z), by= (seq(nrow(df)) - 1) %/% n]
如果您想要每个第n行,请使用%%
代替%/%
df[, mean(z), by= (seq(nrow(df)) - 1) %% n]
答案 2 :(得分:5)
这应该有效。使用n = 13将13行聚集在一起。如果您有27行,则会获得大小为13,13,1的组。
n.colmeans = function(df, n = 10){
aggregate(x = df,
by = list(gl(ceiling(nrow(df)/n), n)[1:nrow(df)]),
FUN = mean)
}
n.colmeans(state.x77, 10)
Group.1 Population Income Illiteracy Life Exp Murder HS Grad Frost Area
1 1 4892.8 4690.8 1.44 70.508 9.53 53.63 75.1 116163.6
2 2 3570.5 4419.4 1.12 71.110 7.07 53.35 99.8 44406.6
3 3 3697.9 4505.5 0.93 70.855 6.64 55.25 131.7 60873.0
4 4 5663.9 4331.2 1.33 70.752 7.12 49.59 103.6 56949.5
5 5 3407.0 4232.1 1.03 71.168 6.53 53.72 112.1 75286.7
答案 3 :(得分:1)
dplyr
路
n1 <- 10
iris %>% group_by(mean = (row_number() -1) %/% n1) %>%
mutate(mean = mean(Sepal.Length))
# A tibble: 150 x 6
# Groups: mean [15]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species mean
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 4.86
2 4.9 3 1.4 0.2 setosa 4.86
3 4.7 3.2 1.3 0.2 setosa 4.86
4 4.6 3.1 1.5 0.2 setosa 4.86
5 5 3.6 1.4 0.2 setosa 4.86
6 5.4 3.9 1.7 0.4 setosa 4.86
7 4.6 3.4 1.4 0.3 setosa 4.86
8 5 3.4 1.5 0.2 setosa 4.86
9 4.4 2.9 1.4 0.2 setosa 4.86
10 4.9 3.1 1.5 0.1 setosa 4.86
# ... with 140 more rows
或者如果 n1 不是除数或 nrow(df) 那么也
n1 <- 7
# A tibble: 150 x 6
# Groups: mean [21]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species mean
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 5.1 3.5 1.4 0.2 setosa 4.9
2 4.9 3 1.4 0.2 setosa 4.9
3 4.7 3.2 1.3 0.2 setosa 4.9
4 4.6 3.1 1.5 0.2 setosa 4.9
5 5 3.6 1.4 0.2 setosa 4.9
6 5.4 3.9 1.7 0.4 setosa 4.9
7 4.6 3.4 1.4 0.3 setosa 4.9
8 5 3.4 1.5 0.2 setosa 4.8
9 4.4 2.9 1.4 0.2 setosa 4.8
10 4.9 3.1 1.5 0.1 setosa 4.8
# ... with 140 more rows
您还可以跨多个列进行变异
mydf <- iris[-5]
mydf %>% group_by(n = (row_number() -1) %/% n1) %>%
mutate(across(everything(), ~ mean(.), .names = "{.col}_mean"))
# A tibble: 150 x 9
# Groups: n [22]
Sepal.Length Sepal.Width Petal.Length Petal.Width n Sepal.Length_me~ Sepal.Width_mean Petal.Length_me~
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5.1 3.5 1.4 0.2 0 4.9 3.39 1.44
2 4.9 3 1.4 0.2 0 4.9 3.39 1.44
3 4.7 3.2 1.3 0.2 0 4.9 3.39 1.44
4 4.6 3.1 1.5 0.2 0 4.9 3.39 1.44
5 5 3.6 1.4 0.2 0 4.9 3.39 1.44
6 5.4 3.9 1.7 0.4 0 4.9 3.39 1.44
7 4.6 3.4 1.4 0.3 0 4.9 3.39 1.44
8 5 3.4 1.5 0.2 1 4.8 3.21 1.43
9 4.4 2.9 1.4 0.2 1 4.8 3.21 1.43
10 4.9 3.1 1.5 0.1 1 4.8 3.21 1.43
# ... with 140 more rows, and 1 more variable: Petal.Width_mean <dbl>