如何在R中迭代执行expand.grid

时间:2014-01-22 07:06:09

标签: r dataframe iteration distribution probability

我正在写一段R代码并且卡住了。

背景(解决问题不是必需的):我通过乘以独立的边际分布来计算联合概率。边缘概率向量由ProbGenerationProcess()迭代生成。在每次迭代时,它将输出一个向量,例如

Iteration 1:
Color =
   Blue  Green
   0.2    0.8   

Iteration 2:
Material =
   Cotton  Silk
    0.7     0.3

Iteration 3:
Country =
   China     USA
    0.6      0.4

......

期望的结果:我希望得到的联合概率是每个边际向量中每个元素的乘积。格式应如下所示。

Color   Material  Country   Prob
Blue    Cotton     China    0.084  (= 0.2*0.7*0.6)
Blue    Cotton     USA      0.056  (= 0.2*0.7*0.4)
Blue    Silk       China    0.036  (= 0.2*0.3*0.6)
Blue    Silk       USA      ..
Green   Cotton     China    ..
Green   Cotton     USA      ..
...     ...        ...      ...

我的实施:以下是我的代码:

joint.names = NULL  # data.from store the marginal value names
joint.probs = NULL  # store probabilities.

for (i in iterations) {
    marginal = ProbGenerationProcess(VarUniqueToIteration) # output is numeric with names

    if ( is.null(joint.names) ) {
        # initialize the dataframes
        joint.names = names(marginal)
        joint.probs = marginal
    } else {
        # (my hope:) iteratively populate the joint.names and joint.probs

        joint.names = expand.grid(joint.names, names(marginal))

        expanded.prob = expand.grid(joint.probs, marginal)
        joint.probs = expanded.prob$Var1 * expanded.prob$Var2 # Row-by-row multiplication.
    }
}

输出:Joint.probs输出总是正确的,但是,joint.names并不像我想要的那样工作。在前两次迭代之后,一切运行良好。我得到了:

joint.names = 
    Var1  Var2
1   Blue  Cotton
2   Green Cotton
3   Blue  Silk
4   Green Silk 
    ...   ...

从第三次迭代开始,它变得有问题:

joint.names =
    Var1.Var1  Var1.Var2  Var1.Var1.1  Var1.Var2.1  Var2
1   Blue       Cotton     Blue         Cotton       China 
2   Green      Cotton     Green        Cotton       China
3   Blue       Silk       Blue         Silk         USA
4   Green      Silk       Green        Silk         USA

我想我的第一个问题是:这是获得我想要的结果的最有效方法吗?如果是这样,expand.grid()是我应该使用的函数,我应该如何正确地初始化它?

感谢任何帮助!

2 个答案:

答案 0 :(得分:2)

合并是你的朋友。

color <- data.frame(color=c("blue","green"),prob=c(0.2,0.8))
material <- data.frame(material=c("cotton","silk"),prob=c(0.7,0.3))
country <- data.frame(country=c("china","usa"),prob=c(0.6,0.4))

dat <- merge(merge(color[1],material[1]),country[1]) # get names first

# same as: expand.grid(c("china","usa"),c("cotton","silk"),c("blue","green"))

dat <- merge(dat, color, by="color")
dat <- merge(dat, material, by="material")
dat <- merge(dat, country, by="country")

dat$joint <- dat$prob.x * dat$prob.y * dat$prob # joint calc

dat <- dat[-grep("^prob",colnames(dat))] # cleanup extra probs

结果:

  country material color joint
1   china   cotton  blue 0.084
2   china   cotton green 0.336
3   china     silk  blue 0.036
4   china     silk green 0.144
5     usa   cotton  blue 0.056
6     usa   cotton green 0.224
7     usa     silk  blue 0.024
8     usa     silk green 0.096

答案 1 :(得分:1)

为简单起见如何(尽管性能是一个问题,合并可能会更好)

PROBS<-data.frame(Item=rep(c("Color","Material","Country"),each=2),
           Value=c("Blue","Green","Cotton","Silk","China","USA"),
           Prob=c(0.2,0.8,0.7,0.3,0.6,0.4))

rownames(PROBS)<-PROBS$Value

GRID<-expand.grid(by(PROBS,PROBS$Item,function(x)x["Value"]))

GRID$probs<-apply(GRID,1,function(x)prod(PROBS[c(x),"Prob"]))

GRID
#  Color Country Material probs
#1  Blue   China   Cotton 0.084
#2 Green   China   Cotton 0.336
#3  Blue     USA   Cotton 0.056
#4 Green     USA   Cotton 0.224
#5  Blue   China     Silk 0.036
#6 Green   China     Silk 0.144
#7  Blue     USA     Silk 0.024
#8 Green     USA     Silk 0.096