我以长格式购买和出售交易,我想将其转换为宽幅格式。看一下例子:
对于某些股票代码的每次买入交易必须存在关闭头寸的同一股票代码的卖出交易。如果卖出交易不存在或股票数量变为零,则将NA置于卖出价格。
说明:
我们以34.56的价格购买了100股AIG股票代码。接下来,我们必须找到同一股票代码AIG的买入交易的退出(卖出)交易。此交易存在于600股以下。因此,我们用100股股票完成AIG买入交易,将卖出交易的股票从600减少到500,并以买入价和卖出价格的宽幅格式写下此交易。
下一笔交易是GOOG。对于这个股票行情,我们发现了两个SELL transactiosn,并以宽幅格式写出,但100股未售出,因此我们将这笔交易作为“未完成”以NA的价格出售。
如果有必要,我可以稍后将算法放入伪代码中。但我希望,我的解释很明确。
我的问题如下:在R中用干净的矢量化代码很容易做到这一点?这种算法很容易用命令式范式语言编写,比如C ++。但是对于R我有麻烦。
编辑1:为R:
添加了输入和输出数据帧inputDF1 <- data.frame(Ticker = c("AIG", "GOOG", rep("AIG", 3), rep("GOOG", 2), rep("NEM", 3)), Side = c(rep("BUY", 4), rep("SELL", 3), "BUY", rep("SELL", 2)), Shares = c(100, 400, 200, 400, 600, 200, 100, 100, 50, 50), Price = c(34.56, 457, 28.56, 24.65, 30.02, 460, 461, 45, 56, 78))
inputDF2 <- data.frame(Ticker = c(rep("AIG", 3), rep("GOOG", 3)), Side = c(rep("BUY", 2), "SELL", "BUY", rep("SELL", 2)), Shares = c(100, 100, 200, 300, 200, 100), Price = c(34, 35, 36, 457, 458, 459))
inputDF3 <- data.frame(Ticker = c(rep("AIG", 3), rep("GOOG", 3)), Side = c(rep("BUY", 2), "SELL", "BUY", rep("SELL", 2)), Shares = c(100, 100, 100, 300, 100, 100), Price = c(34, 35, 36, 457, 458, 459))
outputDF1 <- data.frame(Ticker = c("AIG", rep("GOOG", 3), rep("AIG", 3), rep("NEM", 2)), Side = rep("BUY", 9), Shares = c(100, 200, 100, 100, 200, 300, 100, 50, 50), BuyPrice = c(34.56, 457, 457, 457, 28.56, 24.65, 24.65, 45, 45), SellPrice = c(30.02, 460, 461, NA, 30.02, 30.02, NA, 56, 78))
outputDF2 <- data.frame(Ticker = c(rep("AIG", 2), rep("GOOG", 2)), Side = rep("BUY", 4), Shares = c(100, 100, 200, 100), BuyPrice = c(34, 35, 457, 457), SellPrice = c(36, 36, 458, 459))
outputDF3 <- data.frame(Ticker = c(rep("AIG", 2), rep("GOOG", 3)), Side = rep("BUY", 5), Shares = rep(100, 5), BuyPrice = c(34, 35, rep(457, 3)), SellPrice = c(36, NA, 458, 459, NA))
编辑2:更新了R
的示例和输入/输出数据答案 0 :(得分:3)
使用dcast
中的reshape2
:
> t <- c("AIG", "GOOG", "AIG", "AIG", "AIG", "GOOG", "GOOG")
> sd <- c(rep("BUY", 4), rep("SELL", 3))
> sh <- c(100, 400, 200, 400, 600, 200, 100)
> pr <- c(34.56, 457, 28.56, 24.65, 30.02, 460, 461)
> df <- data.frame(Ticker = t, Side = sd, Shares = sh, Price = pr)
>
> library(reshape2)
> df
Ticker Side Shares Price
1 AIG BUY 100 34.56
2 GOOG BUY 400 457.00
3 AIG BUY 200 28.56
4 AIG BUY 400 24.65
5 AIG SELL 600 30.02
6 GOOG SELL 200 460.00
7 GOOG SELL 100 461.00
> dcast(df, Ticker*Shares ~ Side, value.var="Price")
Ticker Shares BUY SELL
1 AIG 100 34.56 NA
2 AIG 200 28.56 NA
3 AIG 400 24.65 NA
4 AIG 600 NA 30.02
5 GOOG 100 NA 461.00
6 GOOG 200 NA 460.00
7 GOOG 400 457.00 NA
这里的关键点是R中的“基于矢量”通常与“功能”(例如apply()
族)联系在一起,但纯功能方法在这里并不常用,因为你有更新每个(每个部分)购买交易的销售清单。我真的觉得你可以用aggregate
或by
做一些神奇的东西,并且有一个精心设计的功能,但是最好的可读解决方案涉及一个简单的for
循环。
for
inputDF <- data.frame(Ticker = c("AIG", "GOOG", "AIG", "AIG", "AIG", "GOOG", "GOOG"),
Side = c(rep("BUY", 4), rep("SELL", 3)),
Shares = c(100, 400, 200, 400, 600, 200, 100),
Price = c(34.56, 457, 28.56, 24.65, 30.02, 460, 461))
buys <- subset(inputDF,Side=="BUY")
sells <- subset(inputDF,Side=="SELL")
transactions <- NULL
# go through every buy operation
for(i in 1:nrow(buys)){
ticker <- buys[i,"Ticker"]
bp <- buys[i,"Price"]
shares <- buys[i,"Shares"]
# keep going as long as we can find sellers
while(shares > 0 & sum(sells[sells$Ticker == ticker,"Shares"]) > 0){
sp <- sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Price"]
if(sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Shares"] > shares){
shares.sold <- shares
}else{
shares.sold <- sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Shares"]
}
shares <- shares - shares.sold
sells[sells$Shares >= shares & sells$Ticker == ticker,][1,"Shares"] <- sells[sells$Shares >= shares & sells$Ticker == ticker,][1,"Shares"] - shares.sold
transactions <- rbind(transactions,data.frame("Ticker"=ticker
,"Side"="BUY"
,"Shares"=shares.sold
,"BuyPrice"=bp
,"SellPrice"=sp))
}
# not enough sellers
if(shares > 0){
transactions <- rbind(transactions,data.frame("Ticker"=ticker
,"Side"="BUY"
,"Shares"=shares
,"BuyPrice"=bp
,"SellPrice"="NA"))
}
}
print(transactions)
<强>输出:强>
Ticker Side Shares BuyPrice SellPrice
1 AIG BUY 100 34.56 30.02
2 GOOG BUY 200 457.00 460
3 GOOG BUY 100 457.00 461
4 GOOG BUY 100 457.00 NA
5 AIG BUY 200 28.56 30.02
6 AIG BUY 300 24.65 30.02
7 AIG BUY 100 24.65 NA
如果我们尝试使用foreach
包来自动并行化循环,那么更新就变得很明显了。很快就会发现我们在sell
数据框架上存在竞争条件。
apply
上面的代码中存在一些可以改进的低效率。通过rbind()
进行追加操作的效率不是很高,可能会稍微优化一下,或者减少对rbind()
的调用次数或者将它们全部消除。您还可以将所有内容打包到函数中并将其转换为对apply()
的调用,即使对于序列apply()
,该调用也会更快,因为循环在更优化的级别完成。 (对于CPython也是如此 - 列表推导和str.join()
比for循环要快得多,因为它们“更了解”操作的总大小,因为它们是用优化的C编写的。)这里是第一次尝试 - 请注意,我们使用do.call(rbind, list(...))
来简化从原始调用apply
返回的小数据帧列表。这不是非常有效(来自rbindlist
的{{1}}明显更快,请参阅here),但它没有任何外部依赖性。您从data.table
返回的列表实际上以其自己的方式感兴趣 - 每个元素都是您完成整个购买操作所需的事务列表。如果您将行名称添加到apply()
数据框,则可以按名称调用每组事务。
buys
<强>输出:强>
inputDF <- data.frame(Ticker = c("AIG", "GOOG", "AIG", "AIG", "AIG", "GOOG", "GOOG"),
Side = c(rep("BUY", 4), rep("SELL", 3)),
Shares = c(100, 400, 200, 400, 600, 200, 100),
Price = c(34.56, 457, 28.56, 24.65, 30.02, 460, 461))
buys <- subset(inputDF,Side=="BUY")
sells <- subset(inputDF,Side=="SELL")
transactions <- NULL
# go through every buy operation
buy.operation <- function(x){
ticker <- x["Ticker"]
# apply() converts to matix implicity, and all the elements of a matrix have
# have the same data type, so everything gets converted to characters
# thus, we need to convert back
bp <- as.numeric(x["Price"])
shares <- as.numeric(x["Shares"])
# keep going as long as we can find sellers
while(shares > 0 & sum(sells[sells$Ticker == ticker,"Shares"]) > 0){
sp <- sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Price"]
if(sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Shares"] > shares){
shares.sold <- shares
}else{
shares.sold <- sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Shares"]
}
shares <- shares - shares.sold
sells[sells$Shares >= shares & sells$Ticker == ticker,][1,"Shares"] <- sells[sells$Shares >= shares & sells$Ticker == ticker,][1,"Shares"] - shares.sold
transactions <- rbind(transactions,data.frame("Ticker"=ticker
,"Side"="BUY"
,"Shares"=shares.sold
,"BuyPrice"=bp
,"SellPrice"=sp))
}
# not enough sellers
if(shares > 0){
transactions <- rbind(transactions,data.frame("Ticker"=ticker
,"Side"="BUY"
,"Shares"=shares
,"BuyPrice"=bp
,"SellPrice"="NA"))
}
transactions
}
transactions <- do.call(rbind, apply(buys,1,buy.operation) )
# get rid of weird row names
row.names(transactions) <- NULL
print(transactions)
不幸的是,最后一个不完整的AIG交易丢失了。我还没弄清楚如何解决这个问题。
答案 1 :(得分:2)
data.table
)。
由于您未提及有关您的实际数据维度的任何信息,因此我无法进一步优化它。如果你能在真实的数据集上运行它并回写你的发现(注册速度/缩放),那就太好了。
首先,我们要按Side
拆分数据集并执行join
。这是最直接的方法。我也看到@ Mike.Gahan也尝试过这条路线。
require(data.table)
dt1 <- as.data.table(inputDF1)
d1 <- dt1[Side == "BUY"][, N := .N > 1L, by=Ticker]
d2 <- dt1[Side == "SELL"]
setkey(d2, Ticker)
ans = d2[d1, allow.cartesian=TRUE][, Side := NULL]
请注意,
allow.cartesian
不会执行笛卡尔联接。它在这里使用得非常松散。请阅读?data.table
了解详情,或查看this post了解相关信息。基本上,连接将真正快,并且缩放非常好。这不是限制性步骤。
我们现在相应地设置列顺序和名称:
setcolorder(ans, c("Ticker", "Side.1", "Shares.1", "Shares", "Price.1", "Price", "N"))
setnames(ans, c("Ticker", "Side", "Shares", "tmp", "BuyPrice", "SellPrice", "N"))
我们互换Shares
和tmp
,以便Shares
根据N
的值反映我们预期的实际输出,如下所示:
ans[, c("Shares", "tmp") := if (!N[1L])
{ val = Shares[1L]; list(tmp, val) }, by = Ticker]
我们需要一些参数来聚合并获得最终结果:
ans[, `:=`(N2= rep(c(FALSE, TRUE), c(.N-1L, 1L)),
csum = sum(Shares)), by = Ticker][, N2 := !(N2 * (csum != tmp))]
最后,
ans1 = ans[(N2)][, c("N", "N2", "tmp", "csum") := NULL]
ans2 = ans[!(N2)][, N := N * 1L]
if (nrow(ans2) > 0) {
ans2 = ans2[, list("BUY", if (N[1L]) c(Shares+tmp-csum, csum-tmp)
else c(Shares, tmp-csum), BuyPrice, c(SellPrice, NA)), by=Ticker]
}
ans = rbindlist(list(ans1, ans2))
# Ticker Side Shares BuyPrice SellPrice
# 1: AIG BUY 100 34.56 30.02
# 2: GOOG BUY 200 457.00 460.00
# 3: AIG BUY 200 28.56 30.02
# 4: NEM BUY 50 45.00 56.00
# 5: NEM BUY 50 45.00 78.00
# 6: GOOG BUY 100 457.00 461.00
# 7: GOOG BUY 100 457.00 NA
# 8: AIG BUY 300 24.65 30.02
# 9: AIG BUY 100 24.65 NA
我的猜测是这应该很快。但是,有可能进一步优化这一点。如果你选择建立这个答案,我会把它留给你。