我有一个需要双重排序的data.frame,部分内容如下
PERMNO mktcap coef
13974 11711 37.5000 1.2508261
24798 13071 72.8750 0.7413084
15294 11869 75.0000 0.3820783
33114 14170 111.3750 2.3607454
24270 13004 131.2500 4.0205943
37866 14699 131.2500 1.8548012
32190 13995 135.0000 1.7028044
30078 13768 149.2500 1.3376186
28494 13530 150.0000 1.7675992
27966 13469 188.1250 1.3499105
16350 12001 210.3750 1.7627097
30870 13848 225.0000 1.7692176
29154 13581 272.2500 1.6714913
33906 14277 309.3750 2.0797843
39186 14816 322.8750 1.6204331
7638 10911 332.7500 1.0864174
9882 11201 339.0000 1.8405390
38922 14787 363.1250 0.9696966
40638 15018 376.8750 1.5077336
34302 14306 411.7500 1.4610924
8298 11017 453.1250 2.0834445
40770 15034 528.9375 2.3746428
33774 14269 531.3750 2.0195085
32322 14007 560.6250 1.7508435
45258 15560 572.0625 2.2281513
10806 11332 577.5000 1.3420006
30342 13784 593.1250 2.0868992
22026 12722 596.7500 1.2661233
12918 11535 640.0000 2.3642444
43014 15253 641.2500 0.8406199
我需要将所有观察分成100组。该程序包括两个步骤:首先,将所有观察分为10组,并根据变量mktcap
的值进行相同的观察;第二,对于每个组,进一步将一组内的观察结果划分为10组,在变量coef
的值方面具有相同的观察值。然后添加一个新变量,指示每个观察的组。这个新变量的值应该从p1到p100。
答案 0 :(得分:0)
使用提供的数据,我们可以尝试更小的分组并将逻辑扩展到完整的案例。我们从5组开始:
#How many groups
ngrps <- 5
#Average mktcap by group size
val <- sum(df1$mktcap)/ngrps
#Cumulative sum
csum <- cumsum(df1$mktcap)
#Break data up
lbl <- cut(csum, seq(0, max(csum), by=val), labels=paste0("p", 1:5))
#Combine
cbind(df1, lbl)
现在我们看到这对数据起作用了。我们可以将它包装在一个函数中:
#Create Function
part <- function(vec, size) {
val <- sum(vec) / size
csum <- cumsum(vec)
lbl <- cut(csum, breaks=seq(0, max(csum), by=val),
labels=paste0("p", 1:size))
return(lbl)
}
使用这个新的part
函数,我们可以提供任何向量和大小,并将它分解为正确的数量。现在创建一个更大的数据集:
df2 <- df1[sample(1:nrow(df1), 1000, TRUE),]
现在我们有一个足够大的数据集可以分成100组:
library(dplyr)
library(data.table)
df2 %>% mutate(grp1 = part(mktcap, 10)) %>%
group_by(grp1) %>%
mutate(grp2 = part(coef, 10)) %>%
mutate(grp3 = paste0("p", rleid(grp1, grp2))) %>%
select(-grp1, -grp2)
# Source: local data frame [1,000 x 5]
# Groups: grp1 [10]
#
# grp1 PERMNO mktcap coef grp3
# <fctr> <int> <dbl> <dbl> <chr>
# 1 p1 14269 531.375 2.019509 p1
# 2 p1 13469 188.125 1.349911 p1
# 3 p1 14007 560.625 1.750843 p1
# 4 p1 14007 560.625 1.750843 p1
# 5 p1 13469 188.125 1.349911 p1
# 6 p1 13530 150.000 1.767599 p1
# 7 p1 14007 560.625 1.750843 p1
# 8 p1 14277 309.375 2.079784 p1
# 9 p1 14007 560.625 1.750843 p1
# 10 p1 14170 111.375 2.360745 p2
# # ... with 990 more rows
数据强>
df1 <- structure(list(PERMNO = c(11711L, 13071L, 11869L, 14170L, 13004L,
14699L, 13995L, 13768L, 13530L, 13469L, 12001L, 13848L, 13581L,
14277L, 14816L, 10911L, 11201L, 14787L, 15018L, 14306L, 11017L,
15034L, 14269L, 14007L, 15560L, 11332L, 13784L, 12722L, 11535L,
15253L), mktcap = c(37.5, 72.875, 75, 111.375, 131.25, 131.25,
135, 149.25, 150, 188.125, 210.375, 225, 272.25, 309.375, 322.875,
332.75, 339, 363.125, 376.875, 411.75, 453.125, 528.9375, 531.375,
560.625, 572.0625, 577.5, 593.125, 596.75, 640, 641.25), coef = c(1.2508261,
0.7413084, 0.3820783, 2.3607454, 4.0205943, 1.8548012, 1.7028044,
1.3376186, 1.7675992, 1.3499105, 1.7627097, 1.7692176, 1.6714913,
2.0797843, 1.6204331, 1.0864174, 1.840539, 0.9696966, 1.5077336,
1.4610924, 2.0834445, 2.3746428, 2.0195085, 1.7508435, 2.2281513,
1.3420006, 2.0868992, 1.2661233, 2.3642444, 0.8406199)), .Names = c("PERMNO",
"mktcap", "coef"), class = "data.frame", row.names = c("13974",
"24798", "15294", "33114", "24270", "37866", "32190", "30078",
"28494", "27966", "16350", "30870", "29154", "33906", "39186",
"7638", "9882", "38922", "40638", "34302", "8298", "40770", "33774",
"32322", "45258", "10806", "30342", "22026", "12918", "43014"
))