如何下采样数据保持一列的相同频率分布

时间:2014-10-20 23:53:11

标签: r probability frequency downsampling

我正在尝试对数据集进行采样并保持与列之一相同的频率分布。方法是1)确定基线频率分布,2)使用基线频率分布对行范围进行采样,3)使用采样行范围从基线数据帧中选择行,4)比较基线和下采样频率分布。这是两个例子。在两个示例中,具有最高概率的事件被过采样,并且剩余事件被欠采样。

重新采样数据的功能,保持一列的频率相同

sampFreq<-function(df,col,ns) {
  x<-as.factor(df[,col])
  freq_x<-table(x)
  prob_x<-freq_x/sum(freq_x)
  df_prob = prob_x[as.factor(df[,col])]
  nr=nrow(df)
  samp_rows = sample(1:nr,ns,replace=FALSE,prob=df_prob)
  return(df[samp_rows,])
}

示例1

步骤 1)指定目标频率分布 2)转换为概率 3)生成具有目标频率分布的数据 4)使用上面的函数

对数据进行下采样
cfreq_1=c(1,2,3,4,5,4,3,2,1)
freq_1 = matrix(cfreq_1, nrow = 1, ncol = length(cfreq_1), byrow = TRUE,
               dimnames = list(c("row1" ),
                               c(as.character(4+(1:length(cfreq_1))))))
pr_1=freq_1/sum(freq_1)
set.seed(31)
ns=5000
df_1a<-data.frame(nbr = sample(4+(1:length(pr_1)),ns,
                               replace=TRUE,prob=pr_1),
                  ord=1:ns)
df_1b<-sampFreq(df_1a, "nbr", 1000)

5)获取模拟和下采样数据的频率 6)根据尺寸名称的数值

对频率进行排序
tb_1a<-table(df_1a$nbr)
tb_1b<-table(df_1b$nbr)
s_tb_1a<-tb_1a[order(as.numeric(attr(tb_1a,"dimnames")[[1]]))]
s_tb_1b<-tb_1b[order(as.numeric(attr(tb_1b,"dimnames")[[1]]))]

7)绘制指定的概率,以及来自数据和下采样的概率

plot(as.numeric(attr(pr_1,"dimnames")[[2]]),pr_1,log="y",ylim=c(.01,.3),
     cex=1.5,pch=15,col="black",type="o", lty=2, 
     xlab='event',ylab='Probability',main="Example 1, Oversample high prob, undersample low")
points(as.numeric(attr(tb_1a,"dimnames")[[1]]),s_tb_1a/sum(s_tb_1a),
       cex=1.5,pch=16,col="blue",type="o", lty=2)
points(as.numeric(attr(tb_1b,"dimnames")[[1]]),s_tb_1b/sum(s_tb_1b),
       cex=1.5,pch=17,col="red",type="o", lty=1)
legend("topleft",c("prescribed", "data", "sampled"),pch=c(15,16,17),
       col=c("black","blue","red"),lty=c(2,2,1))
grid()

请注意,概率最高的事件被过采样,而其他事件则被采样(红色曲线)。

enter image description here

示例2

txt = "0.028506949  0.059389476  0.285069486  0.282693907  0.242309063  2.974224967
 0.064140634  0.002375579  0.019004632  0.280318328  0.033258107  0.073642950
  0.007126737  0.007126737 39.045017223  2.261551253  0.052262739  0.045136002
  0.014253474  0.035633686  5.223898325  1.073761729  4.150136596  0.009502316
  5.038603160  1.021498990  4.017104169  0.002375579  0.073642950  1.197291840
  0.501247179  0.052262739  0.776814348  0.071267371  8.416676565  0.026131370
  0.019004632  0.002375579  0.168666112  0.023755790  5.718018767  0.501247179
  0.014253474  0.776814348  0.071267371  8.416676565  0.026131370  0.002375579
  0.002375579  0.168666112  0.023755790  5.718018767  0.194797482  0.028506949
  0.137783585  0.016629053  0.002375579  0.494120442  0.007126737  "

# Here is the target frequency distribution
cfreq_2=scan(text=txt,multi.line =TRUE)
freq_2 = matrix(cfreq_2, nrow = 1, ncol = length(cfreq_2), byrow = TRUE,
              dimnames = list(c("row1" ),
                              c(as.character(4+(1:length(cfreq_2))))))
# Convert to probability
pr_2=freq_2/sum(freq_2)

# Generate some data
ns=42095
df_2a<-data.frame(nbr = sample(4+(1:length(pr_2)),ns,
                               replace=TRUE,prob=pr_2),
                  ord=1:ns)
df_2b<-sampFreq(df_2a, "nbr", 10000)

tb_2a<-table(df_2a$nbr)
tb_2b<-table(df_2b$nbr)
s_tb_2a<-tb_2a[order(as.numeric(attr(tb_2a,"dimnames")[[1]]))]
s_tb_2b<-tb_2b[order(as.numeric(attr(tb_2b,"dimnames")[[1]]))]
plot(as.numeric(attr(pr_2,"dimnames")[[2]]),pr_2,log="y",ylim=c(.00001,.7),
     cex=1.5,pch=15,col="black",type="o", lty=2, 
     xlab='event',ylab='Probability',main="Example 2, Oversampled Point With High Prob, Undersampled Others")
points(as.numeric(attr(tb_2a,"dimnames")[[1]]),s_tb_2a/sum(s_tb_2a),
       cex=1.5,pch=16,col="blue",type="o", lty=2)
points(as.numeric(attr(tb_2b,"dimnames")[[1]]),s_tb_2b/sum(s_tb_2b),
       cex=1.5,pch=17,col="red",type="o", lty=1)
legend("topleft",c("prescribed", "data", "sampled"),pch=c(15,16,17),
       col=c("black","blue","red"),lty=c(2,2,1))
grid()

此处,只有一个事件被过采样,而其余事件未被采样。

enter image description here

问题是为什么红线不接近其他线,而且似乎也存在系统误差。

频率较小的不频繁元素对匹配/获得很重要,匹配/获取频繁出现的元素频率(大频率)并不重要。

1 个答案:

答案 0 :(得分:1)

以下功能可提供所需的结果。

sampFreq<-function(cdf,col,ns) {
  x<-as.factor(cdf[,col])
  freq_x<-table(x)
  prob_x<-freq_x/sum(freq_x)
  df_prob = prob_x[as.factor(cdf[,col])]
  nr=nrow(cdf)
  sLevels = levels(as.factor(cdf[,col]))
  nLevels = length(sLevels)
  rat = ns/nr
  rdata = NULL
  for (is in seq(1,nLevels)) {
    ldata <- cdf[cdf[,col]==sLevels[is],]
    ndata <- nrow(ldata)
    nsdata = max(ndata*rat,1)
    srows <- sample(seq(1,ndata),nsdata,replace=rat>1)
    sdata <- ldata[srows,]
    rdata <- rbind(rdata,sdata)
  }
  return(rdata)
}

示例1 enter image description here 例2 enter image description here