我正在尝试对大数据矩阵进行子集化,其示例如下:
row 1/col 1 row 1/col 2 row 1/col 3
[1,] 855.815 749.574 754.950
[2,] 855.718 749.496 755.004
[3,] 855.846 749.359 754.910
[4,] 855.746 749.299 754.795
[5,] 855.805 749.421 754.883
我正在尝试使用以下代码删除第一行的值高于或低于整个第一行的平均值一个标准偏差的列:
library(matrixStats)
x = data[,-1] > (rowMeans(data[,-1]) + rowSds(data[,-1]))
y = data[,-1] < (rowMeans(data[,-1]) - rowSds(data[,-1]))
subset(df2, !(x | y))
但是当应用于我的数据集时会返回以下错误:
Error in x[subset & !is.na(subset), vars, drop = drop] :
(subscript) logical subscript too long
据我所知,R已将其扩展为:
subset(df2, !(data[,-1] > (rowMeans(data[,-1]) + rowSds(data[,-1]))|data[,-1] < (rowMeans(data[,-1]) - rowSds(data[,-1]))))
并且逻辑论证太长了。有什么我想念的吗?我对R缺乏经验,并确信有更简洁的方法可以做到这一点,但从我读过的内容我觉得子集最有用。
提前谢谢。
答案 0 :(得分:1)
你可以试试这个:
df <- as.matrix(read.table(text='C1 C2 C3
[1,] 855.815 749.574 754.950
[2,] 855.718 749.496 755.004
[3,] 855.846 749.359 754.910
[4,] 855.746 749.299 754.795
[5,] 855.805 749.421 754.883', header=TRUE))
library(matrixStats)
df[,which(abs(df[1,] - rowMeans(df)[1]) < rowSds(df)[1])]
# C2 C3
#[1,] 749.574 754.950
#[2,] 749.496 755.004
#[3,] 749.359 754.910
#[4,] 749.299 754.795
#[5,] 749.421 754.883