在使用多个条件的情况下,使用dplyr和ifelse创建新变量

时间:2019-06-19 03:56:06

标签: r dplyr

我有以下数据集。我想创建一个名为“ specialized”的变量。为了创建变量,我需要使用group_by(原文如此,年份)对数据进行分组。然后将创建虚拟变量“ specialized”-

如果在给定的“年和sic”中,“百分比”变量最高,并且最高百分比和第二最高百分比之间的差大于10,则它将被编码为“ 1”,“ 0”否则。

但是,请注意,如果在给定的“年份”和“原文如此”中没有第二高的百分比-意味着只有一个百分比是最高的-则它将被编码为1。这种情况是“我的数据集中“年== 2000”中的sic == 0100”。

我尝试了以下代码

df <- df %>% 
  group_by(sic, year) %>% 
  mutate(SPECIALIZED = ifelse(max(percentage) && (max(percentage)-nth(sort(percentage), - 2)) > 10), 1, 0 ) %>% 
  ungroup()

但是它不起作用。

这是数据-

   gvkey auditor_fkey  year  sic  percentage
1  001266            4 2001 0100  26.9605909
2  003107            2 2000 1000  37.0939127
3  003107            2 2000 1000  37.0939127
4  003107            2 2001 1000   9.8899690
5  003107            2 2001 1000   9.8899690
6  005560            1 2000 1040 100.0000000
7  005560            7 2001 1040   8.2959428
8  007881            5 2001 1040  71.1026743
9  009728          597 2001 1040   1.0906007
10 009728          597 2001 1040   1.0906007
11 010390            2 2000 0100 100.0000000
12 010390            2 2000 0100 100.0000000
13 010390            2 2001 0100  73.0394091
14 010390            2 2001 0100  73.0394091
15 012321            1 2001 1040  18.1873703
16 012321            1 2001 1040  18.1873703
17 014590            5 2000 1000  60.6862904
18 014590            5 2000 1000  60.6862904
19 014590            5 2001 1000  18.8287898
20 014590            5 2001 1000  18.8287898
21 014793            2 2000 1220  34.7515455
22 014793            2 2000 1220  34.7515455
23 014793            2 2001 1220  58.0859392
24 014793            2 2001 1220  58.0859392
25 015274            1 2000 1220  65.2484545
26 015274            1 2000 1220  65.2484545
27 015274            1 2001 1220  41.9140608
28 015274            1 2001 1220  41.9140608
29 019565            1 2001 1000  71.1457384
30 019565            1 2001 1000  71.1457384
31 020488            1 2000 1040 100.0000000
32 020488            1 2001 1040  18.1873703
33 025776            1 2000 1000   2.2197969
34 025776            1 2001 1000  71.1457384
35 031626            2 2000 1000  37.0939127
36 031626            2 2001 1000   9.8899690
37 061811            5 2000 1000  60.6862904
38 061811            5 2001 1000  18.8287898
39 061811            5 2001 1000  18.8287898
40 064134          580 2001 1000   0.1355028
41 064134          580 2001 1000   0.1355028
42 065921            1 2000 1040 100.0000000
43 065921            1 2000 1040 100.0000000
44 065921            1 2001 1040  18.1873703
45 065921            1 2001 1040  18.1873703
46 102341            2 2001 1040   1.3234119
47 142460            2 2001 1220  58.0859392
48 142460            2 2001 1220  58.0859392
49 142460            2 2001 1220  58.0859392

最终数据集应如下所示

    gvkey auditor_fkey year sic  percentage      specialized
1   10390            2 2000 0100 100.0000000           1
2   10390            2 2000 0100 100.0000000           1
3    3107            2 2000 1000  37.0939127           0
4    3107            2 2000 1000  37.0939127           0
5   14590            5 2000 1000  60.6862904           1
6   14590            5 2000 1000  60.6862904           1
7   25776            1 2000 1000   2.2197969           0
8   31626            2 2000 1000  37.0939127           0
9   61811            5 2000 1000  60.6862904           1
10   5560            1 2000 1040 100.0000000           1
11  20488            1 2000 1040 100.0000000           1
12  65921            1 2000 1040 100.0000000           1
13  65921            1 2000 1040 100.0000000           1
14  14793            2 2000 1220  34.7515456           0
15  14793            2 2000 1220  34.7515456           0
16  15274            1 2000 1220  65.2484544           1
17  15274            1 2000 1220  65.2484544           1
18   1266            4 2001 0100  26.9605909           0
19  10390            2 2001 0100  73.0394091           1
20  10390            2 2001 0100  73.0394091           1
21   3107            2 2001 1000   9.8899690           0
22   3107            2 2001 1000   9.8899690           0
23  14590            5 2001 1000  18.8287898           0
24  14590            5 2001 1000  18.8287898           0
25  19565            1 2001 1000  71.1457384           1
26  19565            1 2001 1000  71.1457384           1
27  25776            1 2001 1000  71.1457384           1
28  31626            2 2001 1000   9.8899690           0
29  61811            5 2001 1000  18.8287898           0
30  61811            5 2001 1000  18.8287898           0
31  64134          580 2001 1000   0.1355028           0
32  64134          580 2001 1000   0.1355028           0
33   5560            7 2001 1040   8.2959428           0
34   7881            5 2001 1040  71.1026743           1
35   9728          597 2001 1040   1.0906007           0
36   9728          597 2001 1040   1.0906007           0
37  12321            1 2001 1040  18.1873703           0
38  12321            1 2001 1040  18.1873703           0
39  20488            1 2001 1040  18.1873703           0
40  65921            1 2001 1040  18.1873703           0
41  65921            1 2001 1040  18.1873703           0
42 102341            2 2001 1040   1.3234119           0
43  14793            2 2001 1220  58.0859392           1
44  14793            2 2001 1220  58.0859392           1
45  15274            1 2001 1220  41.9140608           0
46  15274            1 2001 1220  41.9140608           0
47 142460            2 2001 1220  58.0859392           1
48 142460            2 2001 1220  58.0859392           1
49 142460            2 2001 1220  58.0859392           1

感谢您的帮助。

1 个答案:

答案 0 :(得分:0)

数据中的顺序已更改,并且预期结果已更改。因此我改为从结果中获取数据。这是在使用dummy中的hablar创建虚拟对象之前,将逻辑分解为单独的列。

library(hablar)
library(dplyr)

df %>% 
  group_by(sic, year) %>% 
  mutate(second_highest = nth(sort(unique(percentage), decreasing = T), 2), 
         max_value = max(percentage),
         is_max   = percentage == max_value,
         is_ab_10 = (max_value - second_highest) > 10,
         specialized = dummy(is_max & is_ab_10, missing = 1)
    ) %>% 
  ungroup() %>% 
  select(-c(second_highest, max_value, is_max, is_ab_10))

结果

# A tibble: 49 x 6
   gvkey auditor_fkey  year   sic percentage specialized
   <int>        <int> <int> <int>      <dbl>       <int>
 1 10390            2  2000   100     100              1
 2 10390            2  2000   100     100              1
 3  3107            2  2000  1000      37.1            0
 4  3107            2  2000  1000      37.1            0
 5 14590            5  2000  1000      60.7            1
 6 14590            5  2000  1000      60.7            1
 7 25776            1  2000  1000       2.22           0
 8 31626            2  2000  1000      37.1            0
 9 61811            5  2000  1000      60.7            1
10  5560            1  2000  1040     100              1
# … with 39 more rows

数据

df <- structure(list(gvkey = c(10390L, 10390L, 3107L, 3107L, 14590L, 
                         14590L, 25776L, 31626L, 61811L, 5560L, 20488L, 65921L, 65921L, 
                         14793L, 14793L, 15274L, 15274L, 1266L, 10390L, 10390L, 3107L, 
                         3107L, 14590L, 14590L, 19565L, 19565L, 25776L, 31626L, 61811L, 
                         61811L, 64134L, 64134L, 5560L, 7881L, 9728L, 9728L, 12321L, 12321L, 
                         20488L, 65921L, 65921L, 102341L, 14793L, 14793L, 15274L, 15274L, 
                         142460L, 142460L, 142460L), auditor_fkey = c(2L, 2L, 2L, 2L, 
                                                                      5L, 5L, 1L, 2L, 5L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 4L, 2L, 2L, 
                                                                      2L, 2L, 5L, 5L, 1L, 1L, 1L, 2L, 5L, 5L, 580L, 580L, 7L, 5L, 597L, 
                                                                      597L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L), year = c(2000L, 
                                                                                                                                          2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 
                                                                                                                                          2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2001L, 2001L, 
                                                                                                                                          2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 
                                                                                                                                          2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 
                                                                                                                                          2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 
                                                                                                                                          2001L, 2001L, 2001L), sic = c(100L, 100L, 1000L, 1000L, 1000L, 
                                                                                                                                                                        1000L, 1000L, 1000L, 1000L, 1040L, 1040L, 1040L, 1040L, 1220L, 
                                                                                                                                                                        1220L, 1220L, 1220L, 100L, 100L, 100L, 1000L, 1000L, 1000L, 1000L, 
                                                                                                                                                                        1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1040L, 
                                                                                                                                                                        1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 
                                                                                                                                                                        1220L, 1220L, 1220L, 1220L, 1220L, 1220L, 1220L), percentage = c(100, 
                                                                                                                                                                                                                                         100, 37.0939127, 37.0939127, 60.6862904, 60.6862904, 2.2197969, 
                                                                                                                                                                                                                                         37.0939127, 60.6862904, 100, 100, 100, 100, 34.7515456, 34.7515456, 
                                                                                                                                                                                                                                         65.2484544, 65.2484544, 26.9605909, 73.0394091, 73.0394091, 9.889969, 
                                                                                                                                                                                                                                         9.889969, 18.8287898, 18.8287898, 71.1457384, 71.1457384, 71.1457384, 
                                                                                                                                                                                                                                         9.889969, 18.8287898, 18.8287898, 0.1355028, 0.1355028, 8.2959428, 
                                                                                                                                                                                                                                         71.1026743, 1.0906007, 1.0906007, 18.1873703, 18.1873703, 18.1873703, 
                                                                                                                                                                                                                                         18.1873703, 18.1873703, 1.3234119, 58.0859392, 58.0859392, 41.9140608, 
                                                                                                                                                                                                                                         41.9140608, 58.0859392, 58.0859392, 58.0859392)), row.names = c(NA, 
                                                                                                                                                                                                                                                                                                         -49L), class = c("tbl_df", 
                                                                                                                                                                                                                                                                                                                                                                         "tbl", "data.frame"))