Question

这是我之前发布的帖子here。

我在R.工作。

总之，我的载体是巨大的（13gb），但它们不应该是。原始csv文件只是该大小的一小部分。可以想象，13gb比我的机器有更多的内存，更不用说分配给R的内容了。

我目前使用的代码是：

data1<-read.csv("stackexample.csv") ##read in dummy data
data1C<- data1[,3:13] #cut off the ends
SvDvDis<-data1C[c(-3,-4,-6,-7,-9,-10,-11)] #drop individual columns
attach(ScDcDis) #attach for simplicity sake
sm.ancova(s,dt,dip,model="none") #non-parametric ANCOVA

可以在my dropbox上找到虚拟数据文件。

有没有办法减少此功能正在使用的内存，或者是否存在以较少内存密集的方式执行相同分析（非参数ANCOVA）的替代编码/功能？要清楚，不要询问统计数据。我以更有效的方式询问如何做到这一点。

Answer 1

这是我的建议，它在我简陋的笔记本电脑上运行良好。您可以通过平均值测试对其进行补充，以确保样本充分反映人口。

data1   <- read.csv("stackexample.csv") ##read in dummy data

library(dplyr)
library(sm)

data2 <- sample_n(data1, 10000) # make statistics work for you -- sample the data
sm.ancova(x     = data2$s,
          y     = data2$dt,
          group = data2$dip,
          model = "none") #non-parametric ANCOVA

即使只有1,000个样本，我也没有发现平均值有任何显着差异。

t.test(data1$s, data2$s)

  Welch Two Sample t-test

data:  data1$s and data2$s
t = -1.4469, df = 1017.9, p-value = 0.1482
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -37.657822   5.692622
sample estimates:
mean of x mean of y 
 125.3137  141.2963

样本为5,000：

data2 <- sample_n(data1, 5000) # make statistics work for you -- sample the data
t.test(data1$s, data2$s)

  Welch Two Sample t-test

data:  data1$s and data2$s
t = -1.0653, df = 5513.7, p-value = 0.2868
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -14.736700   4.359704
sample estimates:
mean of x mean of y 
 125.3137  130.5022

t.test(data1$dt, data2$dt)

  Welch Two Sample t-test

data:  data1$dt and data2$dt
t = -0.069479, df = 5507.8, p-value = 0.9446
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -18.39645  17.13709
sample estimates:
mean of x mean of y 
 515.6206  516.2503

t.test(data1$dip, data2$dip)

  Welch Two Sample t-test

data:  data1$dip and data2$dip
t = 1.2044, df = 5536.3, p-value = 0.2285
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.6268062  2.6241395
sample estimates:
mean of x mean of y 
 126.6667  125.6680

当然，您可以使用更多/不同的统计信息来验证您的样本，具体取决于您想要的距离。您还可以预先估算功率曲线以确定样本大小。

样本为10,000，我的笔记本电脑上花了大约3分钟。 1000个样本立即完成。

有没有办法减少R中矢量所需的内存？

1 个答案: