这应该是一个基本问题,可能会有重复,但我似乎无法找到它们,所以请耐心等待我并指出正确的地方。谢谢!
我有一个数据框,其中包含可能有NAs和缺失值的整数。我正在计算行平均值(将NAs设置为零)和列平均值(跳过NAs)。我想创建一个包含整数的数据框(或表)以及行均值和列均值。这是一个示例数据框:
df <- data.frame(
'ID' = c("123A","456B","789C","1011","1213")
, 'Test 1' = c(55,65,60,NA,50)
, 'Test 2' = c(45,48,50,52,55)
, 'Test 3' = c(51,49,55,69,61)
)
df
ID Test.1 Test.2 Test.3
1 123A 55 45 51
2 456B 65 48 49
3 789C 60 50 55
4 1011 NA 52 69
5 1213 50 55 61
这是计算列的功能,意味着跳过NAs:
colMean <- function(df, na.rm = TRUE) {
if (na.rm) {
n <- rowSums(!is.na(df))
} else {
n <- ncol(df)
}
colMean <- colMeans(df, na.rm=na.rm)
return(rbind(df, "colMean" = colMean))
}
这是计算行的函数,意味着将NAs设置为零:
rowMeanz <- function(df) {
df[is.na(df)] <- 0
return(cbind(df, "rowMean" = rowMeans(df)))
}
一个问题是rbind改变了数据类型,因为整数在标记为“Test.1”的列中被转换为浮点(或看起来像是):
colMean(df[sapply(df, is.numeric)])
Test.1 Test.2 Test.3
1 55.0 45 51
2 65.0 48 49
3 60.0 50 55
4 NA 52 69
5 50.0 55 61
colMean 57.5 50 57
在你的回答中,我非常感谢解释为什么在这种情况下只有第一列似乎受到影响。它与列中NA的存在有关吗?
我没有观察到基于cbind的其他函数的相同问题:
rowMeanz(df[sapply(df, is.numeric)])
Test.1 Test.2 Test.3 rowMean
1 55 45 51 50.33333
2 65 48 49 54.00000
3 60 50 55 55.00000
4 0 52 69 40.33333
5 50 55 61 55.33333
最终我想获得一个如下所示的数据框或表:
ID Test.1 Test.2 Test.3 rowMean
1 123A 55 45 51 50.33333
2 456B 65 48 49 54.00000
3 789C 60 50 55 55.00000
4 1011 NA 52 69 40.33333
5 1213 50 55 61 55.33333
6 colMean 57.5 50 57
如果你能在不太多的步骤中告诉我如何做到这一点,我将不胜感激。我愿意接受R答案,以及基于包的答案。这些计算将在一个闪亮的应用程序内在线完成,所以我特别希望看到有效的方法。非常感谢!
答案 0 :(得分:1)
不确定我的解决方案是否对您的问题特别有用,但以下是我的方法:
df <- data.frame(
'Test 1' = c(55,65,60,NA,50),
'Test 2' = c(45,48,50,52,55),
'Test 3' = c(51,49,55,69,61)
)
#First, it might be a good idea to set the id as the rownames.
rownames(df) <- c("123A","456B","789C","1011","1213")
#Calculate the col and row means
colMean <- apply(df, 2, function(x) mean(x, na.rm = T))
df$rowMean <- apply(df, 1, function(x) mean(x, na.rm = T))
df <- rbind(df, colMeans)
rownames(df)[nrow(df)] <- "colMean"
答案 1 :(得分:1)
最好将数据转换为所需方式的字符格式,然后将各个部分组合在一起。
df <- data.frame(
row.names = c("123A","456B","789C","1011","1213")
, 'Test 1' = c(55,65,60,NA,50)
, 'Test 2' = c(45,48,50,52,55)
, 'Test 3' = c(51,49,55,69,61)
)
colm <- colMeans(df, na.rm=TRUE)
d0 <- df
d0[is.na(d0)] <- 0
rowm <- rowMeans(d0)
dd <- format(df)
dc <- formatC(colm, digits=1, format="f")
dr <- formatC(rowm, digits=4, format="f")
out <- cbind(rbind(dd, colMeans=dc), rowMeans=c(dr, ""))
print(out, right=FALSE)
## Test.1 Test.2 Test.3 rowMeans
## 123A 55 45 51 50.3333
## 456B 65 48 49 54.0000
## 789C 60 50 55 55.0000
## 1011 NA 52 69 40.3333
## 1213 50 55 61 55.3333
## colMeans 57.5 50.0 57.0
答案 2 :(得分:0)
我想跟进我如何使用Aaron的建议来制作一个汇总数据的表格。它应该很容易扩展到其他统计数据,如min,max,skew等。
数据:
df <- data.frame(
'ID' = c("123A","456B","789C","1011","1213")
, 'Test 1' = c(13,8,14,NA,15)
, 'Test 2' = c(13,4,16,7,12)
, 'Test 3' = c(15,9,13,6,13)
)
计算统计数据的几个函数用于汇总数据:
colMean <- function(df, na.rm = TRUE) {# either remove or annull NAs
if (!na.rm) {# annull NAs
df[is.na(df)] <- 0
}
colMean <- colMeans(df, na.rm=na.rm)
return(colMean)
}
rowMean <- function(df, na.rm = TRUE) {# either remove or annull NAs
if (!na.rm) {# annull NAs
df[is.na(df)] <- 0
}
rowMean <- rowMeans(df, na.rm=na.rm)
return(rowMean)
}
rowSd <- function(df, na.rm = TRUE) {# either remove or annull NAs
if (na.rm) {# remove NAs
n <- rowSums(!is.na(df))
} else {
df[is.na(df)] <- 0
n <- ncol(df)
}
rowMean <- rowMeans(df, na.rm=na.rm)
rowVar <- rowMeans(df*df, na.rm=na.rm) - (rowMeans(df, na.rm=na.rm))^2
rowSd <- sqrt(rowVar * n/(n-1))
return(rowSd)
}
colSd <- function(df, na.rm = TRUE) {# either remove or annull NAs
if (na.rm) {# remove NAs
n <- colSums(!is.na(df))
} else {
df[is.na(df)] <- 0
n <- nrow(df)
}
colMean <- colMeans(df, na.rm=na.rm)
colVar <- colMeans(df*df, na.rm=na.rm) - (colMeans(df, na.rm=na.rm))^2
colSd <- sqrt(colVar * n/(n-1))
return(colSd)
}
摘要是数据框'df',沿列统计'col',沿行'stats'和填充字符'pad'的函数。 'pad'字符可以设置为带有“”的空单元格或设置为NA或其他内容。默认情况下,沿列删除NA,但默认情况下沿行设置为零。
summ <- function(df
, col = list("colMean" = colMean)
, row = list("rowMean" = rowMean)
, pad = NA_character_)
{
dfN <- df[sapply(df, is.numeric)]
colN <-lapply(col, function(x){formatC(x(dfN, na.rm = TRUE), 'digits' = 1, 'format' = "f")})
rowN <-lapply(row, function(x){formatC(x(dfN, na.rm = FALSE), 'digits' = 1, 'format' = "f")})
pad <- rep(pad,'length' = length(colN))
out <- cbind(rbind(format(dfN),do.call(rbind,colN)), lapply(rowN,function(x){c(x,pad)}))
return(print(out, 'right' = FALSE))
}
用法示例:
c <- list("colMean" = colMean, "colSd" = colSd)
r <- list("rowMean" = rowMean, "rowSd" = rowSd)
summ(df)
summ(df,c,r)
summ(df,'col'=c,'row'=r)
summ(df,'col'=c,'row'=r, 'pad'="X")
Test.1 Test.2 Test.3 rowMean rowSd
1 13 13 15 13.7 1.2
2 8 4 9 7.0 2.6
3 14 16 13 14.3 1.5
4 NA 7 6 4.3 3.8
5 15 12 13 13.3 1.5
colMean 12.5 10.4 11.2 X X
colSd 3.1 4.8 3.6 X X
当然,请随意发表评论。谢谢!