您好,我有一个名为“ Sample”的数据集
Sample
A tibble: 221,088 x 7
gvkey two_digit_sic fyear part1 part2 part3 part4
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 001003 57 1987 0.0317 0.0686 0.0380 0.157
2 001003 57 1988 -0.358 0.0623 -0.338 0.162
3 001003 57 1989 -0.155 0.0614 -0.784 0.140
4 001004 50 1988 0.0868 0.00351 0.108 0.300
5 001004 50 1989 0.0176 0.00281 0.113 0.296
6 001004 50 1990 -0.0569 0.00257 0.0618 0.291
7 001004 50 1991 0.00317 0.00263 -0.112 0.314
8 001004 50 1992 -0.0418 0.00253 -0.0479 0.300
9 001004 50 1993 0.00763 0.00274 0.0216 0.334
10 001004 50 1994 -0.0115 0.00239 0.0459 0.307
# ... with 221,078 more rows
count(Sample, gvkey)
# A tibble: 23,978 x 2
gvkey n
<chr> <int>
1 001003 3
2 001004 30
3 001009 7
4 001010 16
5 001011 7
6 001012 2
7 001013 23
8 001014 5
9 001017 8
10 001019 14
# ... with 23,968 more rows
count(Sample, two_digit_sic)
# A tibble: 73 x 2
two_digit_sic n
<chr> <int>
1 01 527
2 02 111
3 07 105
4 08 120
5 09 24
6 10 8860
7 12 477
8 13 11200
9 14 811
10 15 858
# ... with 63 more rows
然后我运行以下模型
library(dplyr)
library(broom)
mjones_1991 <- Sample %>%
group_by(two_digit_sic, fyear) %>%
filter(n()>=10) %>%
do (augment (lm (part1 ~ part2 + part3 + part4, data = .))) %>%
ungroup()
然后我得到了以下结果
mjones_1991
# A tibble: 219,587 x 13
two_digit_sic fyear part1 part2 part3 part4 .fitted .se.fit .resid
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 01 1988 -0.0478 2.36e-2 0.147 1.01 -0.119 0.0576 0.0714
2 01 1988 -0.174 4.29e-2 0.327 0.810 0.00104 0.0560 -0.175
3 01 1988 0.0250 6.15e-4 0.422 0.619 0.0534 0.0711 -0.0284
4 01 1988 -0.0974 2.55e-2 -0.0134 0.292 -0.0847 0.0586 -0.0127
5 01 1988 -0.142 1.15e-3 0.0233 0.677 -0.137 0.0489 -0.0058
6 01 1988 -0.479 2.46e-1 -0.0552 0.538 -0.0393 0.0635 -0.439
7 01 1988 0.00861 2.78e-1 0.251 1.58 -0.0407 0.122 0.0493
8 01 1988 -0.154 2.94e-2 -0.348 0.619 -0.284 0.0984 0.131
9 01 1988 -0.0526 8.96e-4 0.172 0.602 -0.0580 0.0452 0.0053
10 01 1988 -0.0574 2.15e-2 0.0535 0.316 -0.0596 0.0540 0.0021
# ... with 219,577 more rows, and 4 more variables: .hat <dbl>, .sigma <dbl>,
# .cooksd <dbl>, .std.resid <dbl>
问题是我丢失了gvkey;因此,我无法确定.fitted或.se.fit或.resid是哪个gvkey。
这里是对two_digit_sic ==“ 01”和fyear ==“ 1988”的过滤
# A tibble: 18 x 7
gvkey two_digit_sic fyear part1 part2 part3 part4
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 001266 01 1988 -0.0478 0.0236 0.147 1.01
2 002249 01 1988 -0.174 0.0429 0.327 0.810
3 002812 01 1988 0.0250 0.000615 0.422 0.619
4 003702 01 1988 -0.0974 0.0255 -0.0134 0.292
5 008596 01 1988 -0.142 0.00115 0.0233 0.677
6 009062 01 1988 -0.479 0.246 -0.0552 0.538
7 009391 01 1988 0.00861 0.278 0.251 1.58
8 010390 01 1988 -0.154 0.0294 -0.348 0.619
9 010884 01 1988 -0.0526 0.000896 0.172 0.602
10 012349 01 1988 -0.0574 0.0215 0.0535 0.316
11 012750 01 1988 0.0577 0.0157 0.0794 0.422
12 013155 01 1988 0.117 0.124 0.370 0.829
13 013462 01 1988 0.255 0.0828 0.529 0.270
14 013468 01 1988 -0.0774 0.0445 0.129 0.191
15 013550 01 1988 -0.0219 0.0204 0.0375 0.879
16 013743 01 1988 -0.0911 0.228 0.0870 0.739
17 014400 01 1988 0.415 0.546 0.0710 0.0437
18 014881 01 1988 -0.134 0.00380 0.0211 0.666
您可以看到我对two_digit_sic ==“ 01”和fyear ==“ 1988”有18个观察
在~~~ mjones_1991 ~~~数据集中的我有相同的观察结果,但是我丢失了标识符(gvkey)。您知道如何保留gvkey吗?