创建标识变量以标识观察结果

时间:2019-06-25 17:47:02

标签: r dplyr lm broom

您好,我有一个名为“ Sample”的数据集

Sample 
A tibble: 221,088 x 7
   gvkey  two_digit_sic fyear    part1   part2   part3 part4
   <chr>  <chr>         <dbl>    <dbl>   <dbl>   <dbl> <dbl>
 1 001003 57             1987  0.0317  0.0686   0.0380 0.157
 2 001003 57             1988 -0.358   0.0623  -0.338  0.162
 3 001003 57             1989 -0.155   0.0614  -0.784  0.140
 4 001004 50             1988  0.0868  0.00351  0.108  0.300
 5 001004 50             1989  0.0176  0.00281  0.113  0.296
 6 001004 50             1990 -0.0569  0.00257  0.0618 0.291
 7 001004 50             1991  0.00317 0.00263 -0.112  0.314
 8 001004 50             1992 -0.0418  0.00253 -0.0479 0.300
 9 001004 50             1993  0.00763 0.00274  0.0216 0.334
10 001004 50             1994 -0.0115  0.00239  0.0459 0.307
# ... with 221,078 more rows
count(Sample, gvkey)

# A tibble: 23,978 x 2
   gvkey      n
   <chr>  <int>
 1 001003     3
 2 001004    30
 3 001009     7
 4 001010    16
 5 001011     7
 6 001012     2
 7 001013    23
 8 001014     5
 9 001017     8
10 001019    14
# ... with 23,968 more rows

count(Sample, two_digit_sic)

# A tibble: 73 x 2
   two_digit_sic     n
   <chr>         <int>
 1 01              527
 2 02              111
 3 07              105
 4 08              120
 5 09               24
 6 10             8860
 7 12              477
 8 13            11200
 9 14              811
10 15              858
# ... with 63 more rows

然后我运行以下模型

library(dplyr)
library(broom)
mjones_1991 <- Sample %>% 
  group_by(two_digit_sic, fyear) %>% 
  filter(n()>=10) %>% 
  do (augment (lm (part1 ~ part2 + part3 + part4, data = .))) %>% 
  ungroup()

然后我得到了以下结果

mjones_1991

# A tibble: 219,587 x 13
   two_digit_sic fyear    part1   part2   part3 part4  .fitted .se.fit  .resid
   <chr>         <dbl>    <dbl>   <dbl>   <dbl> <dbl>    <dbl>   <dbl>   <dbl>
 1 01             1988 -0.0478  2.36e-2  0.147  1.01  -0.119    0.0576  0.0714 
 2 01             1988 -0.174   4.29e-2  0.327  0.810  0.00104  0.0560 -0.175  
 3 01             1988  0.0250  6.15e-4  0.422  0.619  0.0534   0.0711 -0.0284 
 4 01             1988 -0.0974  2.55e-2 -0.0134 0.292 -0.0847   0.0586 -0.0127 
 5 01             1988 -0.142   1.15e-3  0.0233 0.677 -0.137    0.0489 -0.0058
 6 01             1988 -0.479   2.46e-1 -0.0552 0.538 -0.0393   0.0635 -0.439  
 7 01             1988  0.00861 2.78e-1  0.251  1.58  -0.0407   0.122   0.0493 
 8 01             1988 -0.154   2.94e-2 -0.348  0.619 -0.284    0.0984  0.131  
 9 01             1988 -0.0526  8.96e-4  0.172  0.602 -0.0580   0.0452  0.0053
10 01             1988 -0.0574  2.15e-2  0.0535 0.316 -0.0596   0.0540  0.0021
# ... with 219,577 more rows, and 4 more variables: .hat <dbl>, .sigma <dbl>,
#   .cooksd <dbl>, .std.resid <dbl>

问题是我丢失了gvkey;因此,我无法确定.fitted或.se.fit或.resid是哪个gvkey。

这里是对two_digit_sic ==“ 01”和fyear ==“ 1988”的过滤

# A tibble: 18 x 7
   gvkey  two_digit_sic fyear    part1    part2   part3  part4
   <chr>  <chr>         <dbl>    <dbl>    <dbl>   <dbl>  <dbl>
 1 001266 01             1988 -0.0478  0.0236    0.147  1.01  
 2 002249 01             1988 -0.174   0.0429    0.327  0.810 
 3 002812 01             1988  0.0250  0.000615  0.422  0.619 
 4 003702 01             1988 -0.0974  0.0255   -0.0134 0.292 
 5 008596 01             1988 -0.142   0.00115   0.0233 0.677 
 6 009062 01             1988 -0.479   0.246    -0.0552 0.538 
 7 009391 01             1988  0.00861 0.278     0.251  1.58  
 8 010390 01             1988 -0.154   0.0294   -0.348  0.619 
 9 010884 01             1988 -0.0526  0.000896  0.172  0.602 
10 012349 01             1988 -0.0574  0.0215    0.0535 0.316 
11 012750 01             1988  0.0577  0.0157    0.0794 0.422 
12 013155 01             1988  0.117   0.124     0.370  0.829 
13 013462 01             1988  0.255   0.0828    0.529  0.270 
14 013468 01             1988 -0.0774  0.0445    0.129  0.191 
15 013550 01             1988 -0.0219  0.0204    0.0375 0.879 
16 013743 01             1988 -0.0911  0.228     0.0870 0.739 
17 014400 01             1988  0.415   0.546     0.0710 0.0437
18 014881 01             1988 -0.134   0.00380   0.0211 0.666

您可以看到我对two_digit_sic ==“ 01”和fyear ==“ 1988”有18个观察

在~~~ mjones_1991 ~~~数据集中的

我有相同的观察结果,但是我丢失了标识符(gvkey)。您知道如何保留gvkey吗?

0 个答案:

没有答案