这是我之前发布的帖子here。
我在R.工作。
总之,我的载体是巨大的(13gb),但它们不应该是。原始csv文件只是该大小的一小部分。可以想象,13gb比我的机器有更多的内存,更不用说分配给R的内容了。
我目前使用的代码是:
data1<-read.csv("stackexample.csv") ##read in dummy data
data1C<- data1[,3:13] #cut off the ends
SvDvDis<-data1C[c(-3,-4,-6,-7,-9,-10,-11)] #drop individual columns
attach(ScDcDis) #attach for simplicity sake
sm.ancova(s,dt,dip,model="none") #non-parametric ANCOVA
可以在my dropbox上找到虚拟数据文件。
有没有办法减少此功能正在使用的内存,或者是否存在以较少内存密集的方式执行相同分析(非参数ANCOVA)的替代编码/功能?要清楚,不要询问统计数据。我以更有效的方式询问如何做到这一点。
答案 0 :(得分:0)
这是我的建议,它在我简陋的笔记本电脑上运行良好。您可以通过平均值测试对其进行补充,以确保样本充分反映人口。
data1 <- read.csv("stackexample.csv") ##read in dummy data
library(dplyr)
library(sm)
data2 <- sample_n(data1, 10000) # make statistics work for you -- sample the data
sm.ancova(x = data2$s,
y = data2$dt,
group = data2$dip,
model = "none") #non-parametric ANCOVA
即使只有1,000个样本,我也没有发现平均值有任何显着差异。
t.test(data1$s, data2$s)
Welch Two Sample t-test data: data1$s and data2$s t = -1.4469, df = 1017.9, p-value = 0.1482 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -37.657822 5.692622 sample estimates: mean of x mean of y 125.3137 141.2963
样本为5,000:
data2 <- sample_n(data1, 5000) # make statistics work for you -- sample the data
t.test(data1$s, data2$s)
Welch Two Sample t-test data: data1$s and data2$s t = -1.0653, df = 5513.7, p-value = 0.2868 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -14.736700 4.359704 sample estimates: mean of x mean of y 125.3137 130.5022
t.test(data1$dt, data2$dt)
Welch Two Sample t-test data: data1$dt and data2$dt t = -0.069479, df = 5507.8, p-value = 0.9446 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -18.39645 17.13709 sample estimates: mean of x mean of y 515.6206 516.2503
t.test(data1$dip, data2$dip)
Welch Two Sample t-test data: data1$dip and data2$dip t = 1.2044, df = 5536.3, p-value = 0.2285 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.6268062 2.6241395 sample estimates: mean of x mean of y 126.6667 125.6680
当然,您可以使用更多/不同的统计信息来验证您的样本,具体取决于您想要的距离。您还可以预先估算功率曲线以确定样本大小。
样本为10,000,我的笔记本电脑上花了大约3分钟。 1000个样本立即完成。