R:给出概率时创建分类变量

时间:2015-04-01 21:04:40

标签: r variables distribution

我有两个数据帧可以使用以下代码重现:

df=data.frame(xcode=c("612","920","924","925"),
              ratio.company1=c("0.1","0.9","0.4","0"),
              ratio.company2=c("0.1","0","0.6","0.6"),  
              ratio.company3=c("0.8","0.1","0","0.4"))
df

df2=data.frame(id=c("101","101","101","101","101","101","102","102","102","102","102","103","103","104","104","104","104","104","104","104","104","105","105","105","106","106","106","106","106","106","107","107","107","107","107","107"),
       xcode=c("612","612","612","612","612","612","612","612","612","612","612","920","920","920","920","920","920","920","920","920","920","924","924","924","924","924","924","924","924","924","925","925","925","925","925","925"),
       company=c(""))
df2

df给出了根据xcode字段将人员分配到company1或company 2或Company 3的概率。 df2给我ID和xcodes。根据xcodes给出的比率,df2中的ID需要分为公司1,2,3。

例如,在xcode 612的11个ID中,10 pct被分配到公司1,10 pct被分配给公司2,80 pct被分配给company 3。我想将结果舍入到0位小数。我无法想到实现这一目标的方法。我可以使用runif命令来执行此操作吗?请帮忙。

我的结果数据集如下所示:

df2=data.frame(id=c("101","101","101","101","101","101","102","102","102","102","102","103","103","104","104","104","104","104","104","104","104","105","105","105","106","106","106","106","106","106","107","107","107","107","107","107"),
       xcode=c("612","612","612","612","612","612","612","612","612","612","612","920","920","920","920","920","920","920","920","920","920","924","924","924","924","924","924","924","924","924","925","925","925","925","925","925"),
       company=c("company1","company2","company3","company3","company3","company3","company3","company3","company3","company3","company3",
                 "company1","company1","company1","company1","company1","company1","company1","company1","company1","company3",
                 "company1","company1","company1","company1","company2","company2","company2","company2","company2",
                 "company2","company2","company2","company2","company3","company3"))

1 个答案:

答案 0 :(得分:0)

这将对您的请求提供一种可能的解释:

c('comp1','comp2','comp3')[
                  findInterval( runif(36) , 
                                c(0, cumsum( as.numeric(as.character(df[1,2:4]))) ))]
#-----------
 [1] "comp3" "comp3" "comp3" "comp3" "comp2" "comp3" "comp3" "comp3" "comp3"
[10] "comp3" "comp3" "comp3" "comp2" "comp3" "comp1" "comp3" "comp1" "comp3"
[19] "comp1" "comp3" "comp3" "comp3" "comp3" "comp2" "comp3" "comp3" "comp1"
[28] "comp3" "comp3" "comp3" "comp3" "comp3" "comp3" "comp2" "comp3" "comp3"

我过去回答类似问题的经验是,通常会有一个不言而喻的期望,即0.1,0.1和0.8的比例,这并不符合这种期望。如果您希望以这些比例准确地(或几乎完全地,因为36%的10%不是整数),则需要使用rdirichlet而不是runif。或者,您可以在sample的向量上使用c(rep('comp1', 3), rep('comp2', 4), rep('comp3', 29))