许多时候给出的数据,例如年龄是范围。我想计算这些范围的平均值。我能够计算它,但我觉得有更优雅,也许更快的方式。
以下是工作示例:
age <- c("0-10", "11-20", "21-30", "31-40") # define the age vector in ranges
age_split<-strsplit(age,"-") # gives the list with splits
for(ii in 1:length(age)){
age[ii] <- mean(as.numeric(unlist(age_split[ii])))
}
print(age)
## [1] "5" "15.5" "25.5" "35.5"
根据lmo和akron的建议,这里的代码可以通过各种方法进行性能测试:
irows = 100000
data1 <- paste0(sample(1:10, irows, replace = TRUE),"-", sample(11:20, irows, replace = TRUE))
data2 <- data1; data3 <- data1; data4 <- data1 # replicated for testing different methods
#--method 1 -- originally proposed
a1<-Sys.time()
age_split<-strsplit(data1,"-")
for(ii in 1:length(data1)){
data1[ii] <- mean(as.numeric(unlist(age_split[ii])))
}
Sys.time()-a1
# method 2 (lmo suggestion)
a2<-Sys.time()
data2 <- sapply(strsplit(data2, split="-"), function(i) mean(as.numeric(i)))
Sys.time()-a2
# method 3 (cue from akron)
a3<-Sys.time()
age_split_matrix <-do.call(rbind, strsplit(data3,"-"))
class(age_split_matrix) <- "numeric"
data3<-rowMeans(age_split_matrix)
Sys.time()-a3
# method 4 (akron proposed)
a4<-Sys.time()
data4 <-rowMeans(read.table(text=data4, sep = "-"))
Sys.time()-a4
# validating if outputs match
all.equal(as.numeric(data1),data2)
all.equal(as.numeric(data1),data3)
all.equal(as.numeric(data1),data4)
当irow = 100K时,从方法1到4的时间为:(1)2.5s(2)1.4s(3)0.34s(4)6.3s。当irow = 1mil时,时间为(1)23s(2)14s(3)6s(4)非常长。当irow = 10mil时,时间为(1)3.9分钟(2)2.9分钟(3)非常长。这让我得出结论,read.table真的很慢。方法3占用了大量内存。
答案 0 :(得分:1)
我们可以在rowMeans
中使用data.frame
read.table
rowMeans(read.table(text=age, sep="-"))
#[1] 5.0 15.5 25.5 35.5
答案 1 :(得分:0)
以下是sapply
的单行班次
sapply(strsplit(age, split="-"), function(i) mean(as.numeric(i)))
[1] 5.0 15.5 25.5 35.5
strplit
将字符串拆分为&#34; - &#34;并返回一个列表,该列表被送到sapply
,然后获取每个列表项,将向量转换为数字并计算均值。