如何根据另一个data.table对data.table进行子集化?

时间:2016-05-12 20:15:42

标签: r data.table

我试图了解如何使用data.tables。它进展不顺利。

我有一大堆data.table,带有一堆返回和AUM。我将data.table子集化为两个data.tables,一个带有返回,一个带有AUM。我现在想要对return data.table进行子集化,以便仅获得AUM小于50%的资金的回报。

为了给你一个想法,这是我的代码:

fundDetails <- data.table(read.table("Fund_Deets.csv", sep = ",", fill = TRUE, quote="\"", header=TRUE))
fundNAV <- data.table(read.table("NAV_AUM.csv", sep = ",", fill = TRUE, quote="\"", header=TRUE))

allFundDetails <- fundDetails[Currency == 'USD']
allFundNAV <- fundNAV[Fund.ID %in% allFundDetails$Fund.ID]
allFundAUM <- allFundNAV[Type == 'AUM', -c(1,3), with = FALSE]
allFundAUM <- setnames(data.table(t(sapply(allFundAUM[,-1, with = FALSE],as.numeric))), as.character(allFundAUM$Fund.ID))
allFundReturns <- allFundNAV[Type == 'Return', -c(1,3), with = FALSE]
allFundReturns <- setnames(data.table(t(sapply(allFundReturns[,-1, with = FALSE],as.numeric)/100)), as.character(allFundReturns$Fund.ID))
smallFundReturns <- data.table(sapply(allFundReturns, function(x) rep(NA, length(x))))

这产生了以下三个表(smallFundReturns显然只是NA的):

> allFundAUM[,1:10, with = FALSE]
     33992 33261 38102 33264 33275 5606 41695 40483 41526 45993
  1:    NA    NA    NA    NA    NA   NA    NA    NA     1    27
  2:    NA    NA    NA    NA    NA   NA   117    NA     1    27
  3:    NA    NA    NA    NA    NA   NA   120    NA     1    27
  4:    NA    NA    NA    NA    NA   NA   133    NA     1    27
  5:    NA    NA    NA    NA    NA   NA   146    NA     1    29
 ---                                                           
260:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
261:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
262:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
263:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
264:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
> allFundReturns[,1:10, with = FALSE]
     33992 33261 38102 33264 33275 5606   41695 40483   41526   45993
  1:    NA    NA    NA    NA    NA   NA      NA    NA  0.0188 -0.0116
  2:    NA    NA    NA    NA    NA   NA -0.0315    NA -0.0120  0.0134
  3:    NA    NA    NA    NA    NA   NA -0.0978    NA -0.0908 -0.0206
  4:    NA    NA    NA    NA    NA   NA -0.0445    NA -0.0269 -0.0287
  5:    NA    NA    NA    NA    NA   NA  0.0139    NA  0.0298 -0.0141
 ---                                                                 
260:    NA    NA    NA    NA    NA   NA      NA    NA      NA      NA
261:    NA    NA    NA    NA    NA   NA      NA    NA      NA      NA
262:    NA    NA    NA    NA    NA   NA      NA    NA      NA      NA
263:    NA    NA    NA    NA    NA   NA      NA    NA      NA      NA
264:    NA    NA    NA    NA    NA   NA      NA    NA      NA      NA
> smallFundReturns[,1:10, with = FALSE]
     33992 33261 38102 33264 33275 5606 41695 40483 41526 45993
  1:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
  2:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
  3:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
  4:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
  5:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
 ---                                                           
260:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
261:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
262:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
263:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
264:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA

for (i in 1:nrow(allFundReturns)){
  theSubset <- as.vector(allFundReturns[i,] <= as.numeric(quantile(allFundAUM[i,], .5, na.rm = TRUE)))
  theSubset[is.na(theSubset)] <- FALSE
  theSubset <- colnames(allFundReturns)[theSubset]
  smallFundReturns[i,theSubset, with = FALSE] = allFundReturns[i,theSubset, with = FALSE]
}

我正在尝试使用for for循环进行子集化(使用for循环尝试调试):

for (i in 1:nrow(allFundReturns)){
  theSubset <- as.vector(allFundReturns[i,] <= as.numeric(quantile(allFundAUM[i,], .5, na.rm = TRUE)))
  theSubset[is.na(theSubset)] <- FALSE
  theSubset <- colnames(allFundReturns)[theSubset]
  smallFundReturns[i,theSubset, with = FALSE] = allFundReturns[i,theSubset, with = FALSE]
}

这会产生错误:

Error in `[<-.data.table`(`*tmp*`, i, theSubset, with = FALSE, value = list( : 
  unused argument (with = FALSE)

我尝试删除'with'部分,但这会发出一堆警告:

> warnings()
Warning messages:
1: In `[<-.data.table`(`*tmp*`, i, theSubset, value = c("41526",  ... :
  Supplied 3020 items to be assigned to 1 items of column '41526' (3019 unused)
2: In `[<-.data.table`(`*tmp*`, i, theSubset, value = c("41526",  ... :
  Supplied 3020 items to be assigned to 1 items of column '45993' (3019 unused)
3: In `[<-.data.table`(`*tmp*`, i, theSubset, value = c("41526",  ... :
  Supplied 3020 items to be assigned to 1 items of column '45994' (3019 unused)
4: In `[<-.data.table`(`*tmp*`, i, theSubset, value = c("41526",  ... :

我对如何做到这一点很困惑。关于我如何通过第一个数据子集对第二个data.table进行子集的任何想法?

编辑:

我尝试了以下建议:

smallFundReturns[i,(theSubset):=allFundReturns[i,(theSubset), with = FALSE], with = FALSE]

我收到了这些警告():

> warnings()
Warning messages:
1: In `[.data.table`(smallFundReturns, i, `:=`((theSubset),  ... :
  Coerced 'double' RHS to 'logical' to match the column's type; may have truncated precision. Either change the target column to 'double' first (by creating a new 'double' vector length 264 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'logical' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.
2: In `[.data.table`(smallFundReturns, i, `:=`((theSubset),  ... :
  Coerced 'double' RHS to 'logical' to match the column's type; may have truncated precision. Either change the target column to 'double' first (by creating a new 'double' vector length 264 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'logical' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.
3: In `[.data.table`(smallFundReturns, i, `:=`((theSubset),  ... :

代码产生了这个,在任何地方我都期待一个数字为“TRUE”:

> smallFundReturns[,1:10, with = FALSE]
     33992 33261 38102 33264 33275 5606 41695 40483 41526 45993
  1:    NA    NA    NA    NA    NA   NA    NA    NA  TRUE  TRUE
  2:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
  3:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
  4:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
  5:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
 ---                                                           
260:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
261:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
262:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
263:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA
264:    NA    NA    NA    NA    NA   NA    NA    NA    NA    NA

编辑2:

我想出了这个问题。显然,这一行:

smallFundReturns <- data.table(sapply(allFundReturns, function(x) rep(NA, length(x))))

将表创建为逻辑。我把它改成了这一行:

smallFundReturns <- data.table(sapply(allFundReturns, function(x) as.numeric(rep(NA, length(x)))))

在@HubertL修复之后一切正常。谢谢!!

2 个答案:

答案 0 :(得分:1)

你必须这样写:

smallFundReturns[i,(theSubset):=allFundReturns[i,(theSubset), with = FALSE], with = FALSE]

答案 1 :(得分:1)

改进建议:

尝试使用fread而不是read.table读取数据。它的速度更快,结果是data.table而不是data.frame。

进行&#34; data.table操作&#34;声明&#34;,= = FALSE&#34;你实际上强迫R使用更慢的data.frame操作,而不是使用超快的data.table方法。

玩得开心