我有一个长度为100000的data.frame。现在我想计算不同的data.frame长度(如0.01到0.99之间的水平)这个子集中的正值和负值。
> dput(sumDF[1:100])
structure(c(3000, 2000, 5000, 4000, 1000, 4000, 0, 3000, 4000,
2000, 2000, 3000, 1000, -3000, 2000, 0, 4000, 1000, 1000, 2000,
2000, 2000, 2000, 1000, 3000, 1000, 4000, 3000, 2000, 3000, 1000,
1000, 4000, 2000, 0, 1000, 2000, 5000, 3000, 3000, 0, 2000, 2000,
3000, 1000, -1000, 2000, 1000, 2000, 3000, 2000, 3000, 2000,
2000, 2000, 2000, 3000, 3000, 3000, 2000, 3000, 3000, 1000, 3000,
1000, 2000, 1000, -1000, 0, 2000, 2000, 3000, 0, 3000, 2000,
2000, 5000, 3000, 2000, 1000, 3000, 3000, 4000, 1000, 2000, 2000,
3000, 0, 3000, 1000, 0, 4000, 4000, 2000, 3000, 0, 2000, 4000,
0, 0), .Names = c("modelOutcome1", "modelOutcome2", "modelOutcome3",
"modelOutcome4", "modelOutcome5", "modelOutcome6", "modelOutcome7",
"modelOutcome8", "modelOutcome9", "modelOutcome10", "modelOutcome11",
"modelOutcome12", "modelOutcome13", "modelOutcome14", "modelOutcome15",
"modelOutcome16", "modelOutcome17", "modelOutcome18", "modelOutcome19",
"modelOutcome20", "modelOutcome21", "modelOutcome22", "modelOutcome23",
"modelOutcome24", "modelOutcome25", "modelOutcome26", "modelOutcome27",
"modelOutcome28", "modelOutcome29", "modelOutcome30", "modelOutcome31",
"modelOutcome32", "modelOutcome33", "modelOutcome34", "modelOutcome35",
"modelOutcome36", "modelOutcome37", "modelOutcome38", "modelOutcome39",
"modelOutcome40", "modelOutcome41", "modelOutcome42", "modelOutcome43",
"modelOutcome44", "modelOutcome45", "modelOutcome46", "modelOutcome47",
"modelOutcome48", "modelOutcome49", "modelOutcome50", "modelOutcome51",
"modelOutcome52", "modelOutcome53", "modelOutcome54", "modelOutcome55",
"modelOutcome56", "modelOutcome57", "modelOutcome58", "modelOutcome59",
"modelOutcome60", "modelOutcome61", "modelOutcome62", "modelOutcome63",
"modelOutcome64", "modelOutcome65", "modelOutcome66", "modelOutcome67",
"modelOutcome68", "modelOutcome69", "modelOutcome70", "modelOutcome71",
"modelOutcome72", "modelOutcome73", "modelOutcome74", "modelOutcome75",
"modelOutcome76", "modelOutcome77", "modelOutcome78", "modelOutcome79",
"modelOutcome80", "modelOutcome81", "modelOutcome82", "modelOutcome83",
"modelOutcome84", "modelOutcome85", "modelOutcome86", "modelOutcome87",
"modelOutcome88", "modelOutcome89", "modelOutcome90", "modelOutcome91",
"modelOutcome92", "modelOutcome93", "modelOutcome94", "modelOutcome95",
"modelOutcome96", "modelOutcome97", "modelOutcome98", "modelOutcome99",
"modelOutcome100"))
> levels <- c(0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99)
> levelLength <- length(sumDF) * levels
> levelLength
[1] 1000 5000 10000 20000 30000 40000 50000 60000 70000 80000 90000 95000 99000
我的问题是我得到“data.frame”应该有多长时间,但我没有得到data.frame中“赢家”和“输家”的数量。 因此,1维data.frame的值大于0,赢家,或小于或等于0,输入。
为了显示这个例子,我的data.frame长度为100000
。在1%的水平上,它的长度仅为1000
。作为示例,从这1000个元素中,800高于0且低于或等于0。
如何获取800
和200
?
我尝试了以下内容:
countWin <- length(sumDF[1:levelLength > 0])
Warning message:
In 1:levelLength : numerical expression has 13 elements: only the first used
任何建议,如何从我的载体中获得一定数量的元素?
感谢您的回复。
更新
示例:
我的data.frame sumDF看起来像这样:
> sumDF[1:3]
modelOutcome1 modelOutcome2 modelOutcome3
3000 2000 5000
我的data.frame sumDF的长度为100000
我希望将data.frame sumDF与以下级别长度进行子集化。
> levelLength
[1] 1000 5000 10000 20000 30000 40000 50000 60000 70000 80000 90000 95000 99000
因此对于levelLength 1000,我想将sumDF从0到1000进行子集化。
此外,在这个子集中,我想计算所有价值>0
,我的赢家以及所有<=0
,我的输家。
我的最终data.frame看起来应该是这样的:
"levels" "winners" "losers"
0.01 900 100
0.05 2400 2600
0.10 6000 4000
0.20 . .
0.30 . .
0.40
0.50
0.60
0.70
0.80
0.90
0.95
0.99
答案 0 :(得分:1)
dput
输出为vector
。要获得小于0的sum
值,
sum(sumDF<0)
#[1] 3
我们还可以使用table
来获取输家和赢家的频率
table(sumDF <0)
#FALSE TRUE
# 97 3
如果我们有data.frame
或matrix
colSums(sumDF <0)
我不确定我是否了解最近的编辑,也许我们在cut
将对象放入不同的箱子后得到'sumDF'的频率。使用cut
,我们可以通过指定breaks
来获取这些组。
levellength <- c(1, 5, seq(10, 90, by=10), 95, 99)
tbl <- table(cut(sumDF, breaks=levellength), sumDF)
假设,如果我们需要获取每个组的累积总和,请在使用cumsum
循环遍历“tbl”列后使用apply
。
tbl1 <- apply(tbl, 2, cumsum)
可以使用rownames
匹配括号后面的数字(sub
)来更改标签((
),并将其替换为1.
rownames(tbl1) <- sub('(?<=\\()\\d+', '1', rownames(tbl1), perl=TRUE)
tbl1
# sumDF
# -3000 -1000 0 1000 2000 3000 4000 5000
#(1,5] 0 0 0 0 0 0 0 0
#(1,10] 0 0 0 0 0 0 0 0
#(1,20] 0 0 0 0 0 0 0 0
#(1,30] 0 0 0 0 0 0 0 0
#(1,40] 0 0 0 0 0 0 0 0
#(1,50] 0 0 0 0 0 0 0 0
#(1,60] 0 0 0 0 0 0 0 0
#(1,70] 0 0 0 0 0 0 0 0
#(1,80] 0 0 0 0 0 0 0 0
#(1,90] 0 0 0 0 0 0 0 0
#(1,95] 0 0 0 0 0 0 0 0
#(1,99] 0 0 0 0 0 0 0 0
注意:根据输入示例,频率均为0。
我们还可以通过使用cut
参数来更改labels
内的标签。我们创建一个自定义标签('lvls')并在cut
中使用它。除此之外,下面的代码与上面的代码类似。
lvls <- paste0('(', '1,', c(5,seq(10,90, by=10), 95, 99), ']')
tbl <- table(sumDF, cut(sumDF, breaks=levellength, labels=lvls))
apply(tbl, 1, cumsum)