将字符串与R中另一个数据框中包含的值进行匹配的最快方法

时间:2014-05-08 21:28:45

标签: r list dataframe plyr

抱歉这个笨重的头衔,我在优雅地说出我需要做的事情时遇到了麻烦。以下是一些示例代码:

a = c("12_36","13_47","10_55")
b = c("15_47")
c = NULL
d = c("Trader1", "Trader2", "Trader3","Trader4")
Profits = data.frame(Traders = d, Value = I(list(a,b,b,c)), 
                     Cost = I(list(b,a,c,a)), 
              Date = as.Date(c("2011-08-01",
                               "2011-08-02","2011-08-03","2011-08-04")))
Reference = data.frame(Index = rep(c(a,b), 4), 
                       MktPrice = c(1,4,5,6,
                                    2,3.5,7.0,8.574,
                                    9.2345,1.689,0.567,4.5362,
                                    2.35,7.66673,7.88893,6.1221),
                       Date = as.Date(c("2011-08-01","2011-08-01",
                                        "2011-08-01","2011-08-01",
                                        "2011-08-02","2011-08-02",
                                        "2011-08-02","2011-08-02",
                                        "2011-08-03","2011-08-03",
                                        "2011-08-03","2011-08-03",
                                        "2011-08-04","2011-08-04",
                                        "2011-08-04","2011-08-04")))

这会创建两个数据帧。第一个利润包含四列:第一列包含虚拟市场中交易者的名称。第二个和第三个为每个交易者包含一个字符串向量,表示他们收到或交易的项目。这些字符串对应于Reference中包含每天“市场价格”的值。最后一列利润是该交易的日期。

现在我想要做的是获取Profits的Value and Cost列中每个项目的值, 找到每个项目的相应市场价格,并从成本项目的价格中减去价值项目的价格,并将此总和作为利润的第五列。

所以我想知道最好的方法是什么?我认为它将是某种嵌套函数,通过Value和Cost然后与Reference匹配,但我不确定是什么(plyr?)。速度也很重要,因为实际数据帧都非常大。 提前谢谢!

1 个答案:

答案 0 :(得分:1)

所以我修改了样本以使用NA而不是NULL

a = c("12_36","13_47","10_55")
b = c("15_47")
c = NA
d = c("Trader1", "Trader2", "Trader3","Trader4")
Profits = data.frame(
    Traders = d, Value = I(list(a,b,b,c)), 
    Cost = I(list(b,a,c,a)), 
    Date = as.Date(c("2011-08-01",
        "2011-08-02","2011-08-03","2011-08-04"))
)
Reference = data.frame(
    Index = rep(c(a,b), 4), 
    MktPrice = c(1,4,5,6,
    2,3.5,7.0,8.574,
    9.2345,1.689,0.567,4.5362,
    2.35,7.66673,7.88893,6.1221),
    Date = as.Date(c("2011-08-01","2011-08-01",
    "2011-08-01","2011-08-01","2011-08-02",
    "2011-08-02","2011-08-02","2011-08-02",
    "2011-08-03","2011-08-03","2011-08-03",
    "2011-08-03","2011-08-04","2011-08-04",
    "2011-08-04","2011-08-04"))
)

然后我将利润去标准化

dProfits<-do.call(rbind, lapply(seq.int(nrow(Profits)), function(i) {
    data.frame(Traders = Profits[i,1],
        Value = Profits[i,2][[1]],
        Cost = Profits[i,3][[1]],
        Date = Profits[i,4]
       ,stringsAsFactors=F)
}))

然后我使用了标准的合并类型程序

mm<-merge(dProfits, Reference, 
    by.x=c("Value","Date"), by.y=c("Index","Date"))
mm<-merge(mm, Reference, , suffixes=c("",".Cost"),
    all.x=T, by.x=c("Cost","Date"), by.y=c("Index","Date"))
mm<-transform(mm,diff = MktPrice - MktPrice.Cost)

您必须查看它如何在您的数据上运行。与标准data.table

相比,data.frame可能会获得更好的合并效果