在data.frame中进行双重排序

时间:2016-09-30 02:16:28

标签: r dataframe

我有一个需要双重排序的data.frame,部分内容如下

       PERMNO   mktcap      coef
13974  11711  37.5000 1.2508261
24798  13071  72.8750 0.7413084
15294  11869  75.0000 0.3820783
33114  14170 111.3750 2.3607454
24270  13004 131.2500 4.0205943
37866  14699 131.2500 1.8548012
32190  13995 135.0000 1.7028044
30078  13768 149.2500 1.3376186
28494  13530 150.0000 1.7675992
27966  13469 188.1250 1.3499105
16350  12001 210.3750 1.7627097
30870  13848 225.0000 1.7692176
29154  13581 272.2500 1.6714913
33906  14277 309.3750 2.0797843
39186  14816 322.8750 1.6204331
7638   10911 332.7500 1.0864174
9882   11201 339.0000 1.8405390
38922  14787 363.1250 0.9696966
40638  15018 376.8750 1.5077336
34302  14306 411.7500 1.4610924
8298   11017 453.1250 2.0834445
40770  15034 528.9375 2.3746428
33774  14269 531.3750 2.0195085
32322  14007 560.6250 1.7508435
45258  15560 572.0625 2.2281513
10806  11332 577.5000 1.3420006
30342  13784 593.1250 2.0868992
22026  12722 596.7500 1.2661233
12918  11535 640.0000 2.3642444
43014  15253 641.2500 0.8406199

我需要将所有观察分成100组。该程序包括两个步骤:首先,将所有观察分为10组,并根据变量mktcap的值进行相同的观察;第二,对于每个组,进一步将一组内的观察结果划分为10组,在变量coef的值方面具有相同的观察值。然后添加一个新变量,指示每个观察的组。这个新变量的值应该从p1到p100。

1 个答案:

答案 0 :(得分:0)

使用提供的数据,我们可以尝试更小的分组并将逻辑扩展到完整的案例。我们从5组开始:

#How many groups
ngrps <- 5
#Average mktcap by group size
val <- sum(df1$mktcap)/ngrps
#Cumulative sum
csum <- cumsum(df1$mktcap)
#Break data up
lbl <- cut(csum, seq(0, max(csum), by=val), labels=paste0("p", 1:5))
#Combine
cbind(df1, lbl)

现在我们看到这对数据起作用了。我们可以将它包装在一个函数中:

#Create Function
part <- function(vec, size) {
  val <- sum(vec) / size
  csum <- cumsum(vec)
  lbl <- cut(csum, breaks=seq(0, max(csum), by=val), 
           labels=paste0("p", 1:size))
  return(lbl)
}

使用这个新的part函数,我们可以提供任何向量和大小,并将它分解为正确的数量。现在创建一个更大的数据集:

df2 <- df1[sample(1:nrow(df1), 1000, TRUE),]

现在我们有一个足够大的数据集可以分成100组:

library(dplyr)
library(data.table)
df2 %>% mutate(grp1 = part(mktcap, 10)) %>%
  group_by(grp1) %>%
  mutate(grp2 = part(coef, 10)) %>%
  mutate(grp3 = paste0("p", rleid(grp1, grp2))) %>%
  select(-grp1, -grp2)
# Source: local data frame [1,000 x 5]
# Groups: grp1 [10]
# 
#      grp1 PERMNO  mktcap     coef  grp3
#    <fctr>  <int>   <dbl>    <dbl> <chr>
# 1      p1  14269 531.375 2.019509    p1
# 2      p1  13469 188.125 1.349911    p1
# 3      p1  14007 560.625 1.750843    p1
# 4      p1  14007 560.625 1.750843    p1
# 5      p1  13469 188.125 1.349911    p1
# 6      p1  13530 150.000 1.767599    p1
# 7      p1  14007 560.625 1.750843    p1
# 8      p1  14277 309.375 2.079784    p1
# 9      p1  14007 560.625 1.750843    p1
# 10     p1  14170 111.375 2.360745    p2
# # ... with 990 more rows

数据

df1 <- structure(list(PERMNO = c(11711L, 13071L, 11869L, 14170L, 13004L, 
14699L, 13995L, 13768L, 13530L, 13469L, 12001L, 13848L, 13581L, 
14277L, 14816L, 10911L, 11201L, 14787L, 15018L, 14306L, 11017L, 
15034L, 14269L, 14007L, 15560L, 11332L, 13784L, 12722L, 11535L, 
15253L), mktcap = c(37.5, 72.875, 75, 111.375, 131.25, 131.25, 
135, 149.25, 150, 188.125, 210.375, 225, 272.25, 309.375, 322.875, 
332.75, 339, 363.125, 376.875, 411.75, 453.125, 528.9375, 531.375, 
560.625, 572.0625, 577.5, 593.125, 596.75, 640, 641.25), coef = c(1.2508261, 
0.7413084, 0.3820783, 2.3607454, 4.0205943, 1.8548012, 1.7028044, 
1.3376186, 1.7675992, 1.3499105, 1.7627097, 1.7692176, 1.6714913, 
2.0797843, 1.6204331, 1.0864174, 1.840539, 0.9696966, 1.5077336, 
1.4610924, 2.0834445, 2.3746428, 2.0195085, 1.7508435, 2.2281513, 
1.3420006, 2.0868992, 1.2661233, 2.3642444, 0.8406199)), .Names = c("PERMNO", 
"mktcap", "coef"), class = "data.frame", row.names = c("13974", 
"24798", "15294", "33114", "24270", "37866", "32190", "30078", 
"28494", "27966", "16350", "30870", "29154", "33906", "39186", 
"7638", "9882", "38922", "40638", "34302", "8298", "40770", "33774", 
"32322", "45258", "10806", "30342", "22026", "12918", "43014"
))