我有以下数据集。我想创建一个名为“ specialized”的变量。为了创建变量,我需要使用group_by(原文如此,年份)对数据进行分组。然后将创建虚拟变量“ specialized”-
如果在给定的“年和sic”中,“百分比”变量最高,并且最高百分比和第二最高百分比之间的差大于10,则它将被编码为“ 1”,“ 0”否则。
但是,请注意,如果在给定的“年份”和“原文如此”中没有第二高的百分比-意味着只有一个百分比是最高的-则它将被编码为1。这种情况是“我的数据集中“年== 2000”中的sic == 0100”。
我尝试了以下代码
df <- df %>%
group_by(sic, year) %>%
mutate(SPECIALIZED = ifelse(max(percentage) && (max(percentage)-nth(sort(percentage), - 2)) > 10), 1, 0 ) %>%
ungroup()
但是它不起作用。
这是数据-
gvkey auditor_fkey year sic percentage
1 001266 4 2001 0100 26.9605909
2 003107 2 2000 1000 37.0939127
3 003107 2 2000 1000 37.0939127
4 003107 2 2001 1000 9.8899690
5 003107 2 2001 1000 9.8899690
6 005560 1 2000 1040 100.0000000
7 005560 7 2001 1040 8.2959428
8 007881 5 2001 1040 71.1026743
9 009728 597 2001 1040 1.0906007
10 009728 597 2001 1040 1.0906007
11 010390 2 2000 0100 100.0000000
12 010390 2 2000 0100 100.0000000
13 010390 2 2001 0100 73.0394091
14 010390 2 2001 0100 73.0394091
15 012321 1 2001 1040 18.1873703
16 012321 1 2001 1040 18.1873703
17 014590 5 2000 1000 60.6862904
18 014590 5 2000 1000 60.6862904
19 014590 5 2001 1000 18.8287898
20 014590 5 2001 1000 18.8287898
21 014793 2 2000 1220 34.7515455
22 014793 2 2000 1220 34.7515455
23 014793 2 2001 1220 58.0859392
24 014793 2 2001 1220 58.0859392
25 015274 1 2000 1220 65.2484545
26 015274 1 2000 1220 65.2484545
27 015274 1 2001 1220 41.9140608
28 015274 1 2001 1220 41.9140608
29 019565 1 2001 1000 71.1457384
30 019565 1 2001 1000 71.1457384
31 020488 1 2000 1040 100.0000000
32 020488 1 2001 1040 18.1873703
33 025776 1 2000 1000 2.2197969
34 025776 1 2001 1000 71.1457384
35 031626 2 2000 1000 37.0939127
36 031626 2 2001 1000 9.8899690
37 061811 5 2000 1000 60.6862904
38 061811 5 2001 1000 18.8287898
39 061811 5 2001 1000 18.8287898
40 064134 580 2001 1000 0.1355028
41 064134 580 2001 1000 0.1355028
42 065921 1 2000 1040 100.0000000
43 065921 1 2000 1040 100.0000000
44 065921 1 2001 1040 18.1873703
45 065921 1 2001 1040 18.1873703
46 102341 2 2001 1040 1.3234119
47 142460 2 2001 1220 58.0859392
48 142460 2 2001 1220 58.0859392
49 142460 2 2001 1220 58.0859392
最终数据集应如下所示
gvkey auditor_fkey year sic percentage specialized
1 10390 2 2000 0100 100.0000000 1
2 10390 2 2000 0100 100.0000000 1
3 3107 2 2000 1000 37.0939127 0
4 3107 2 2000 1000 37.0939127 0
5 14590 5 2000 1000 60.6862904 1
6 14590 5 2000 1000 60.6862904 1
7 25776 1 2000 1000 2.2197969 0
8 31626 2 2000 1000 37.0939127 0
9 61811 5 2000 1000 60.6862904 1
10 5560 1 2000 1040 100.0000000 1
11 20488 1 2000 1040 100.0000000 1
12 65921 1 2000 1040 100.0000000 1
13 65921 1 2000 1040 100.0000000 1
14 14793 2 2000 1220 34.7515456 0
15 14793 2 2000 1220 34.7515456 0
16 15274 1 2000 1220 65.2484544 1
17 15274 1 2000 1220 65.2484544 1
18 1266 4 2001 0100 26.9605909 0
19 10390 2 2001 0100 73.0394091 1
20 10390 2 2001 0100 73.0394091 1
21 3107 2 2001 1000 9.8899690 0
22 3107 2 2001 1000 9.8899690 0
23 14590 5 2001 1000 18.8287898 0
24 14590 5 2001 1000 18.8287898 0
25 19565 1 2001 1000 71.1457384 1
26 19565 1 2001 1000 71.1457384 1
27 25776 1 2001 1000 71.1457384 1
28 31626 2 2001 1000 9.8899690 0
29 61811 5 2001 1000 18.8287898 0
30 61811 5 2001 1000 18.8287898 0
31 64134 580 2001 1000 0.1355028 0
32 64134 580 2001 1000 0.1355028 0
33 5560 7 2001 1040 8.2959428 0
34 7881 5 2001 1040 71.1026743 1
35 9728 597 2001 1040 1.0906007 0
36 9728 597 2001 1040 1.0906007 0
37 12321 1 2001 1040 18.1873703 0
38 12321 1 2001 1040 18.1873703 0
39 20488 1 2001 1040 18.1873703 0
40 65921 1 2001 1040 18.1873703 0
41 65921 1 2001 1040 18.1873703 0
42 102341 2 2001 1040 1.3234119 0
43 14793 2 2001 1220 58.0859392 1
44 14793 2 2001 1220 58.0859392 1
45 15274 1 2001 1220 41.9140608 0
46 15274 1 2001 1220 41.9140608 0
47 142460 2 2001 1220 58.0859392 1
48 142460 2 2001 1220 58.0859392 1
49 142460 2 2001 1220 58.0859392 1
感谢您的帮助。
答案 0 :(得分:0)
数据中的顺序已更改,并且预期结果已更改。因此我改为从结果中获取数据。这是在使用dummy
中的hablar
创建虚拟对象之前,将逻辑分解为单独的列。
library(hablar)
library(dplyr)
df %>%
group_by(sic, year) %>%
mutate(second_highest = nth(sort(unique(percentage), decreasing = T), 2),
max_value = max(percentage),
is_max = percentage == max_value,
is_ab_10 = (max_value - second_highest) > 10,
specialized = dummy(is_max & is_ab_10, missing = 1)
) %>%
ungroup() %>%
select(-c(second_highest, max_value, is_max, is_ab_10))
结果
# A tibble: 49 x 6
gvkey auditor_fkey year sic percentage specialized
<int> <int> <int> <int> <dbl> <int>
1 10390 2 2000 100 100 1
2 10390 2 2000 100 100 1
3 3107 2 2000 1000 37.1 0
4 3107 2 2000 1000 37.1 0
5 14590 5 2000 1000 60.7 1
6 14590 5 2000 1000 60.7 1
7 25776 1 2000 1000 2.22 0
8 31626 2 2000 1000 37.1 0
9 61811 5 2000 1000 60.7 1
10 5560 1 2000 1040 100 1
# … with 39 more rows
数据
df <- structure(list(gvkey = c(10390L, 10390L, 3107L, 3107L, 14590L,
14590L, 25776L, 31626L, 61811L, 5560L, 20488L, 65921L, 65921L,
14793L, 14793L, 15274L, 15274L, 1266L, 10390L, 10390L, 3107L,
3107L, 14590L, 14590L, 19565L, 19565L, 25776L, 31626L, 61811L,
61811L, 64134L, 64134L, 5560L, 7881L, 9728L, 9728L, 12321L, 12321L,
20488L, 65921L, 65921L, 102341L, 14793L, 14793L, 15274L, 15274L,
142460L, 142460L, 142460L), auditor_fkey = c(2L, 2L, 2L, 2L,
5L, 5L, 1L, 2L, 5L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 4L, 2L, 2L,
2L, 2L, 5L, 5L, 1L, 1L, 1L, 2L, 5L, 5L, 580L, 580L, 7L, 5L, 597L,
597L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L), year = c(2000L,
2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L,
2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2001L, 2001L,
2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L,
2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L,
2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L,
2001L, 2001L, 2001L), sic = c(100L, 100L, 1000L, 1000L, 1000L,
1000L, 1000L, 1000L, 1000L, 1040L, 1040L, 1040L, 1040L, 1220L,
1220L, 1220L, 1220L, 100L, 100L, 100L, 1000L, 1000L, 1000L, 1000L,
1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1040L,
1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L,
1220L, 1220L, 1220L, 1220L, 1220L, 1220L, 1220L), percentage = c(100,
100, 37.0939127, 37.0939127, 60.6862904, 60.6862904, 2.2197969,
37.0939127, 60.6862904, 100, 100, 100, 100, 34.7515456, 34.7515456,
65.2484544, 65.2484544, 26.9605909, 73.0394091, 73.0394091, 9.889969,
9.889969, 18.8287898, 18.8287898, 71.1457384, 71.1457384, 71.1457384,
9.889969, 18.8287898, 18.8287898, 0.1355028, 0.1355028, 8.2959428,
71.1026743, 1.0906007, 1.0906007, 18.1873703, 18.1873703, 18.1873703,
18.1873703, 18.1873703, 1.3234119, 58.0859392, 58.0859392, 41.9140608,
41.9140608, 58.0859392, 58.0859392, 58.0859392)), row.names = c(NA,
-49L), class = c("tbl_df",
"tbl", "data.frame"))