Question

我的数据框如下，

> mydata
date  station  treatment  subject   par
A       a         0         R1      1.3    
A       a         0         R1      1.4    
A       a         1         R2      1.4   
A       a         1         R2      1.1    
A       b         0         R1      1.5    
A       b         0         R1      1.8     
A       b         1         R2      2.5     
A       b         1         R2      9.5    
B       a         0         R1      0.3    
B       a         0         R1      8.2    
B       a         1         R2      7.3    
B       a         1         R2      0.2    
B       b         0         R1      9.4    
B       b         0         R1      3.2    
B       b         1         R2      3.5    
B       b         1         R2      2.4 
....

其中：

date是2级A / B的因素; station是2级a / b的因素; treatment是2级0/1的因素;

subject是分配给治疗的R1至R20的重复（10至treatment 0，10至治疗1）;

和 par是我的参数，它是每个日期和站点每个主题的粒度重复测量

我需要做的是：在10个相等的箱子中划分par并计算每个箱子中的数量。这必须在mydata的子集中完成，这些子集由日期站和主题组合定义。最终结果必须是daframe myres，如下所示：

> myres
    date  station  treatment  bin.centre  freq
    A       a         0         1.2        4 
    A       a         0         1.3        3    
    A       a         0         1.4        2 
    A       a         0         1.5        1    
    A       a         1         1.2        4    
    A       a         1         1.3        3    
    A       a         1         1.4        2     
    A       a         1         1.5        1    
    B       b         0         2.3        5   
    B       b         0         2.4        4    
    B       b         0         2.5        3    
    B       b         0         2.6        2   
    B       b         1         2.3        5   
    B       b         1         2.4        4   
    B       b         1         2.5        3   
    B       b         1         2.6        2
    ....

这是我到目前为止所做的：

#define the number of bins
num.bins<-10

#define the width of each bins
bin.width<-(max(par)-min(par))/num.bins

#define the lower and upper boundaries of each bins
bins<-seq(from=min(par), to=max(par), by=bin.width)

#define the centre of each bins
bin.centre<-c(seq(min(bins)+bin.width/2,max(bins)-bin.width/2,by=bin.width))

#create a vector to store the frequency in each bins

  freq<-numeric(length(length(bins-1)))

 # this is the loop that counts the frequency of particles between the lower and upper boundaries
 of each bins and store the result in freq

 for(i in 1:10){
    freq[i]<-length(which(par>=bins[i] &
    par<bins[i+1]))
     }

 #create the data frame with the results
 res<-data.frame(bin.centre,res)

我的第一种方法是使用subset()为主题站和日期的每个组合手动对mydata进行子集，并为每个子集应用上述命令序列，然后构建组合每个单{{1}的最终数据帧使用res，但此过程非常错综复杂，并且会受到错误传播的影响。我想做的是自动执行上述程序，以便计算每个主题的分箱频率分布。我的直觉是，最好的方法是创建一个估算这个粒子分布的函数，然后通过for循环将它应用于每个主题。但是，我不知道该怎么做。任何建议都会非常感激。

感谢利玛。

Answer 1

您可以使用plyr包中的功能通过几个步骤完成此操作。这允许您将数据拆分为所需的块，将统计数据应用于每个块，并合并结果。

首先我设置一些虚拟数据：

set.seed(1)
n <- 100
dat <- data.frame(
    date=sample(LETTERS[1:2], n, replace=TRUE),
    station=sample(letters[1:2], n, replace=TRUE),
    treatment=sample(0:1, n, replace=TRUE),
    subject=paste("R", sample(1:2, n, replace=TRUE), sep=""),
    par=runif(n, 0, 5)
)
head(dat)

  date station treatment subject       par
1    A       b         0      R2 3.2943880
2    A       a         0      R1 0.9253498
3    B       a         1      R1 4.7718907
4    B       b         0      R1 4.4892425
5    A       b         0      R1 4.7184853
6    B       a         1      R2 3.6184538

现在我使用名为cut的基础中的函数将par分成相等大小的bin：

dat$bin <- cut(dat$par, breaks=10)

现在有趣的一点。加载包plyr并使用函数ddply进行拆分，应用和组合。因为您需要频率计数，我们可以使用函数length来计算每个复制在该bin中出现的次数：

library(plyr)
res <- ddply(dat, .(date, station, treatment, bin), 
  summarise, freq=length(treatment))
head(res)

  date station treatment             bin freq
1    A       a         0 (0.00422,0.501]    1
2    A       a         0   (0.501,0.998]    2
3    A       a         0      (1.5,1.99]    4
4    A       a         0     (1.99,2.49]    2
5    A       a         0     (2.49,2.99]    2
6    A       a         0     (2.99,3.48]    1

对许多主题重复应用功能

1 个答案: