我正在尝试对数据集进行采样并保持与列之一相同的频率分布。方法是1)确定基线频率分布,2)使用基线频率分布对行范围进行采样,3)使用采样行范围从基线数据帧中选择行,4)比较基线和下采样频率分布。这是两个例子。在两个示例中,具有最高概率的事件被过采样,并且剩余事件被欠采样。
重新采样数据的功能,保持一列的频率相同
sampFreq<-function(df,col,ns) {
x<-as.factor(df[,col])
freq_x<-table(x)
prob_x<-freq_x/sum(freq_x)
df_prob = prob_x[as.factor(df[,col])]
nr=nrow(df)
samp_rows = sample(1:nr,ns,replace=FALSE,prob=df_prob)
return(df[samp_rows,])
}
步骤 1)指定目标频率分布 2)转换为概率 3)生成具有目标频率分布的数据 4)使用上面的函数
对数据进行下采样cfreq_1=c(1,2,3,4,5,4,3,2,1)
freq_1 = matrix(cfreq_1, nrow = 1, ncol = length(cfreq_1), byrow = TRUE,
dimnames = list(c("row1" ),
c(as.character(4+(1:length(cfreq_1))))))
pr_1=freq_1/sum(freq_1)
set.seed(31)
ns=5000
df_1a<-data.frame(nbr = sample(4+(1:length(pr_1)),ns,
replace=TRUE,prob=pr_1),
ord=1:ns)
df_1b<-sampFreq(df_1a, "nbr", 1000)
5)获取模拟和下采样数据的频率 6)根据尺寸名称的数值
对频率进行排序tb_1a<-table(df_1a$nbr)
tb_1b<-table(df_1b$nbr)
s_tb_1a<-tb_1a[order(as.numeric(attr(tb_1a,"dimnames")[[1]]))]
s_tb_1b<-tb_1b[order(as.numeric(attr(tb_1b,"dimnames")[[1]]))]
7)绘制指定的概率,以及来自数据和下采样的概率
plot(as.numeric(attr(pr_1,"dimnames")[[2]]),pr_1,log="y",ylim=c(.01,.3),
cex=1.5,pch=15,col="black",type="o", lty=2,
xlab='event',ylab='Probability',main="Example 1, Oversample high prob, undersample low")
points(as.numeric(attr(tb_1a,"dimnames")[[1]]),s_tb_1a/sum(s_tb_1a),
cex=1.5,pch=16,col="blue",type="o", lty=2)
points(as.numeric(attr(tb_1b,"dimnames")[[1]]),s_tb_1b/sum(s_tb_1b),
cex=1.5,pch=17,col="red",type="o", lty=1)
legend("topleft",c("prescribed", "data", "sampled"),pch=c(15,16,17),
col=c("black","blue","red"),lty=c(2,2,1))
grid()
请注意,概率最高的事件被过采样,而其他事件则被采样(红色曲线)。
txt = "0.028506949 0.059389476 0.285069486 0.282693907 0.242309063 2.974224967
0.064140634 0.002375579 0.019004632 0.280318328 0.033258107 0.073642950
0.007126737 0.007126737 39.045017223 2.261551253 0.052262739 0.045136002
0.014253474 0.035633686 5.223898325 1.073761729 4.150136596 0.009502316
5.038603160 1.021498990 4.017104169 0.002375579 0.073642950 1.197291840
0.501247179 0.052262739 0.776814348 0.071267371 8.416676565 0.026131370
0.019004632 0.002375579 0.168666112 0.023755790 5.718018767 0.501247179
0.014253474 0.776814348 0.071267371 8.416676565 0.026131370 0.002375579
0.002375579 0.168666112 0.023755790 5.718018767 0.194797482 0.028506949
0.137783585 0.016629053 0.002375579 0.494120442 0.007126737 "
# Here is the target frequency distribution
cfreq_2=scan(text=txt,multi.line =TRUE)
freq_2 = matrix(cfreq_2, nrow = 1, ncol = length(cfreq_2), byrow = TRUE,
dimnames = list(c("row1" ),
c(as.character(4+(1:length(cfreq_2))))))
# Convert to probability
pr_2=freq_2/sum(freq_2)
# Generate some data
ns=42095
df_2a<-data.frame(nbr = sample(4+(1:length(pr_2)),ns,
replace=TRUE,prob=pr_2),
ord=1:ns)
df_2b<-sampFreq(df_2a, "nbr", 10000)
tb_2a<-table(df_2a$nbr)
tb_2b<-table(df_2b$nbr)
s_tb_2a<-tb_2a[order(as.numeric(attr(tb_2a,"dimnames")[[1]]))]
s_tb_2b<-tb_2b[order(as.numeric(attr(tb_2b,"dimnames")[[1]]))]
plot(as.numeric(attr(pr_2,"dimnames")[[2]]),pr_2,log="y",ylim=c(.00001,.7),
cex=1.5,pch=15,col="black",type="o", lty=2,
xlab='event',ylab='Probability',main="Example 2, Oversampled Point With High Prob, Undersampled Others")
points(as.numeric(attr(tb_2a,"dimnames")[[1]]),s_tb_2a/sum(s_tb_2a),
cex=1.5,pch=16,col="blue",type="o", lty=2)
points(as.numeric(attr(tb_2b,"dimnames")[[1]]),s_tb_2b/sum(s_tb_2b),
cex=1.5,pch=17,col="red",type="o", lty=1)
legend("topleft",c("prescribed", "data", "sampled"),pch=c(15,16,17),
col=c("black","blue","red"),lty=c(2,2,1))
grid()
此处,只有一个事件被过采样,而其余事件未被采样。
问题是为什么红线不接近其他线,而且似乎也存在系统误差。
频率较小的不频繁元素对匹配/获得很重要,匹配/获取频繁出现的元素频率(大频率)并不重要。
答案 0 :(得分:1)
以下功能可提供所需的结果。
sampFreq<-function(cdf,col,ns) {
x<-as.factor(cdf[,col])
freq_x<-table(x)
prob_x<-freq_x/sum(freq_x)
df_prob = prob_x[as.factor(cdf[,col])]
nr=nrow(cdf)
sLevels = levels(as.factor(cdf[,col]))
nLevels = length(sLevels)
rat = ns/nr
rdata = NULL
for (is in seq(1,nLevels)) {
ldata <- cdf[cdf[,col]==sLevels[is],]
ndata <- nrow(ldata)
nsdata = max(ndata*rat,1)
srows <- sample(seq(1,ndata),nsdata,replace=rat>1)
sdata <- ldata[srows,]
rdata <- rbind(rdata,sdata)
}
return(rdata)
}
示例1 例2