我正在考虑计算平均值和标准值。开发。数据集中每一行的数据。但是,我想忽略开始的零。
Row 1: 0 0 0 0 9 0 8 5
Row 2: 0 0 3 5 6 0 0 0
我想计算[9 0 8 5]
和[3 5 6 0 0 0]
有没有简单的方法来做R dataframe?
答案 0 :(得分:4)
也许不是最优雅的,但在这种情况下你可以使用cumsum
。
尝试:
> apply(mydf, 1, function(x) mean(x[cumsum(x) > 0]))
[1] 5.500000 2.333333
您可以通过将功能移到apply
之外来扩展这个想法,以便您可以自定义要添加的功能,如下所示:
myFun <- function(x) {
x <- x[cumsum(x) > 0]
c(mean = mean(x), sd = sd(x))
}
apply(mydf, 1, myFun)
# [,1] [,2]
# mean 5.500000 2.333333
# sd 4.041452 2.732520
答案 1 :(得分:4)
使用矢量化rowMeans
函数怎么样?
rowMeans(replace(dat, col(dat) < max.col(dat != 0, ties.method="first"), NA), na.rm=TRUE)
#[1] 5.500000 2.333333
如果速度是大型数据集的关注点,这将比使用apply
快得多。如果没有,apply
肯定更具可读性。
不幸的是,这种方法有点伤害灵活性,因为rowX
函数对于一切都不存在。
rowSds
包中有matrixStats
,这也很快:
library(matrixStats)
rowSds(as.matrix(replace(dat, col(dat) < max.col(dat != 0, ties.method="first"), NA)))
#[1] 4.041452 2.732520
答案 2 :(得分:3)
尝试
apply(df1, 1, function(x)
mean(x[Position(function(y) y >0, x):length(x)]))
#[1] 5.500000 2.333333
apply(df1, 1, function(x) sd(x[Position(function(y)
y >0, x):length(x)]))
#[1] 4.041452 2.732520
我们可以将它包装在一个函数
中f1 <- function(dat, ...){
args <- as.list(match.call())[-(1:2)]
res <- sapply(args, function(FUN) apply(dat, 1, function(x){
x <- x[Position(function(y) y > 0 & !is.na(y), x):length(x)]
eval(FUN)(x, na.rm=TRUE)
}
))
colnames(res) <- args
res
}
f1(df1, mean)
# mean
#[1,] 5.500000
#[2,] 2.333333
f1(df1, mean, sd, median)
# mean sd median
#[1,] 5.500000 4.041452 6.5
#[2,] 2.333333 2.732520 1.5
f1(df2, mean, sd)
# mean sd
#[1,] 7.333333 2.081666
#[2,] 1.500000 3.000000
f1(df3, mean, sd)
# mean sd
#[1,] 7.333333 2.081666
#[2,] 1.500000 3.000000
df1 <- structure(list(v1 = c(0L, 0L), v2 = c(0L, 0L), v3 = c(0L, 3L),
v4 = c(0L, 5L), v5 = c(9L, 6L), v6 = c(0L, 0L), v7 = c(8L,
0L), v8 = c(5L, 0L)), .Names = c("v1", "v2", "v3", "v4",
"v5", "v6", "v7", "v8"), class = "data.frame", row.names = c(NA, -2L))
df2 <- structure(list(v1 = c(0L, 0L), v2 = c(0L, 0L), v3 = c(NA, 0),
v4 = c(0, 0), v5 = c(9L, 6L), v6 = c(NA, 0L), v7 = c(8L,
0L), v8 = c(5L, 0L)), .Names = c("v1", "v2", "v3", "v4",
"v5", "v6", "v7", "v8"), row.names = c(NA, -2L), class = "data.frame")
df3 <- structure(list(v1 = c(0L, 0L), v2 = c(0L, 0L), v3 = c(0, 0),
v4 = c(0, 0), v5 = c(9L, 6L), v6 = c(NA, 0L), v7 = c(8L,
0L), v8 = c(5L, 0L)), .Names = c("v1", "v2", "v3", "v4",
"v5", "v6", "v7", "v8"), row.names = c(NA, -2L), class = "data.frame")
答案 3 :(得分:0)
尝试:
c <- c(0, 0, 0, 0, 9, 0, 8, 5
, 0, 0, 3, 5, 6, 0, 0, 0)
df <- as.data.frame(matrix(c, 2, 8, byrow = T))
for ( i in 1:2 ) {
x <- sapply(df[i, 1:8], as.numeric)
y <- match(NA,match(x, 0))
z <- x[y:8]
df[i,"Avg"] <- mean(z)
df[i,"Sd"] <- sd(z)
}
rm(c,x,y,z)
df
# V1 V2 V3 V4 V5 V6 V7 V8 Avg Sd
# 1 0 0 0 0 9 0 8 5 5.500000 4.041452
# 2 0 0 3 5 6 0 0 0 2.333333 2.732520