相对较新的R,正在处理一个包含数百万行的项目,所以我做了这个例子:
我有一个包含三行不同数据的矩阵。
如果[,1] [,2] [Farm]的组合总共少于两个观察值,则该行的[Farm]值将更改为q99999。这样他们就属于同一组,以便以后分析。
A <- matrix(c(1,1,2,3,4,5,5), ncol = 7)
B <- matrix(c(T,T,F,T,F,T,T), ncol = 7)
C <- matrix(c("Req","Req","Req","fd","as","f","bla"), ncol = 7)
AB <- rbind.fill.matrix(A,B, C)
AB <-t(AB)
colnames(AB) <- c("Col1", "Col2", "Farm")
format(AB)
Col1 Col2 Farm
1 "1 " "1 " "Req"
2 "1 " "1 " "Req"
3 "2 " "0 " "Req"
4 "3 " "1 " "fd "
5 "4 " "0 " "as "
6 "5 " "1 " "f "
7 "5 " "1 " "bla"
所以预期结果如下:
Col1 Col2 Farm
1 "1 " "1 " "Req"
2 "1 " "1 " "Req"
3 "2 " "0 " "q99999"
4 "3 " "1 " "q99999"
5 "4 " "0 " "q99999"
6 "5 " "1 " "q99999"
7 "5 " "1 " "q99999"
现在“Farm”,“Req”和“q99999”
列有两组在尽可能快地保持性能的同时,R的最佳方法是什么?
答案 0 :(得分:2)
使用data.table
包的可能解决方案:
library(data.table)
as.data.table(AB)[,Farm:=ifelse(.N>1, Farm, "q99999"),.(Col1, Col2, Farm)][]
# Col1 Col2 Farm
#1: 1 1 Req
#2: 1 1 Req
#3: 2 0 q99999
#4: 3 1 q99999
#5: 4 0 q99999
#6: 5 1 q99999
#7: 5 1 q99999
或以R
为基础ave
:
AB[,'Farm'] = ave(AB[,'Farm'], do.call(c,apply(AB,2,list)), FUN=function(x) ifelse(length(x)==1, 'q99999',x))
# Col1 Col2 Farm
#1 "1" "1" "Req"
#2 "1" "1" "Req"
#3 "2" "0" "q99999"
#4 "3" "1" "q99999"
#5 "4" "0" "q99999"
#6 "5" "1" "q99999"
#7 "5" "1" "q99999"