R中同一组的列中有多少个类别?

时间:2018-05-08 13:39:20

标签: r dataframe group-by categories

我有一个数据框(df),我在两个不同的公司(公司ID)和他们各自的性别(M或F)中有两年(2006年和2007年)的董事(DirectorID)。

df <-
CompanyID   Name    Country ISIN     Director_2006  Gender_2006 Director_2007 Gender_2007   
25830      BANKxxx     Austria  AT000504  11734844255     M     11734844255      M       
25830      BANKxxx     Austria  AT000504  187836811559    F      5524344997      F        
25830      BANKxxx     Austria  AT000504    5524344997    F      5524354997      M        
25830      BANKxxx     Austria  AT000504    5524354997    M      5742347684      M        
25830      BANKxxx     Austria  AT000504    6613115791    M      40160443378     M          
12339      BANKyyy     Belgium  AT034003    5524344997    M      5524344997      M        
12339      BANKyyy     Belgium  AT034003    5524354997    M      5524354997      M        

我想在每个性别列之后添加更多5列,​​即&#34; Gender_2006&#34;和&#34; Gender_2007&#34;,并提供以下信息:

  • 第1栏:该年度该公司的女性人数
  • 第2栏:该年度该公司的男性人数
  • 第3栏:如果该年度该公司中至少有一名女性,我会添加数字1,如果没有,我会添加数字0
  • 第4栏:该年度该公司中女性(F)的百分比
  • 第5栏:Blau指数计算

df_final是我预期的最终输出。

df_final <-
CompanyID  Name  Country  ISIN   Director_2006 Gender_2006 F2006 M2006 Findex2006 Fperce2006  Blauindex2006  Director_2007  Gender_2007  F2007  M2007  Findex2007 Fperce2007  Blauindex2007     
25830    BANKxxx Austria AT000504 11734844255     M        2       3       1           0.4     0.25           11734844255        M         1     4       1         0.25           0.07      
25830    BANKxxx Austria AT000504 187836811559    F        NA      NA     NA            NA     NA              5524344997        F         NA    NA      NA        NA             NA           
25830    BANKxxx Austria AT000504 5524344997      F        NA      NA     NA            NA     NA              5524354997        M         NA    NA      NA        NA             NA
25830    BANKxxx Austria AT000504 5524354997      M        NA      NA     NA            NA     NA              5742347684      M           NA    NA      NA        NA             NA
25830    BANKxxx Austria AT000504 6613115791      M        NA      NA     NA            NA     NA              40160443378     M           NA    NA      NA        NA             NA
12339    BANKyyy Belgium AT034003 5524344997      M        0       2      0             0      0               5524344997      M           0     2       0         0              0
12339    BANKyyy Belgium AT034003 5524354997      M        NA      NA     NA            NA     NA              5524354997      M           NA    NA      NA        NA             NA

拜托,有人可以告诉我吗?感谢。

我的数据

df <- read.table(text = 
               "CompanyID   Name    Country ISIN     Director_2006  Gender_2006 Director_2007 Gender_2007  
                25830      BANKxxx     Austria  AT000504  11734844255     M     11734844255      M        
                25830      BANKxxx     Austria  AT000504  187836811559    F      5524344997      F       
                25830      BANKxxx     Austria  AT000504    5524344997    F      5524354997      M       
                25830      BANKxxx     Austria  AT000504    5524354997    M      5742347684      M       
                25830      BANKxxx     Austria  AT000504    6613115791    M      40160443378     M         
                12339      BANKyyy     Belgium  AT034003    5524344997    M      5524344997      M       
                12339      BANKyyy     Belgium  AT034003    5524354997    M      5524354997      M",
                header = T, stringsAsFactors = F)

1 个答案:

答案 0 :(得分:1)

dplyr group_by子句中的以下内容表示您正在分组的内容,在本例中为companyID。 mutate将根据您指定的条件创建新行。 select只是改变了排序。

library(dplyr)
df  %>% group_by(CompanyID) %>%
    mutate(F2006 = sum(Gender_2006 == "F", na.rm = T),
            M2006 = sum(Gender_2006 == "M", na.rm = T),
            Findex2006 = as.integer(sum(Gender_2006 == "F", na.rm = T)>0),
            Fperce2006 = F2006/(F2006+M2006),
            F2007 = sum(Gender_2007 == "F", na.rm = T),
            M2007 = sum(Gender_2007 == "M", na.rm = T),
            Findex2007 = as.integer(sum(Gender_2007 == "F", na.rm = T)>0),
            Fperce2007 = F2007/(F2007+M2007)) %>% 
    select(-matches("2006|2007"),matches("2006"), matches("2007"))



# A tibble: 8 x 16
# Groups: CompanyID [2]
#   CompanyID Name    Country ISIN     Director_2006 Gender_2006 F2006 M2006 Findex2006 Fperce2006 Director_2007 Gender_2007
#       <int> <fct>   <fct>   <fct>            <dbl> <fct>       <int> <int>      <int>      <dbl>         <dbl> <fct>      
# 1     25830 BANKxxx Austria AT000504   11734844255 M               2     3          1      0.400   11734844255 M          
# 2     25830 BANKxxx Austria AT000504  187836811559 F               2     3          1      0.400    5524344997 F          
# 3     25830 BANKxxx Austria AT000504    5524344997 F               2     3          1      0.400    5524354997 M          
# 4     25830 BANKxxx Austria AT000504    5524354997 M               2     3          1      0.400    5742347684 M          
# 5     25830 BANKxxx Austria AT000504    6613115791 M               2     3          1      0.400   40160443378 M          
# 6     12339 BANKyyy Belgium AT034003    5524344997 M               0     2          0      0        5524344997 M          
# 7     12339 BANKyyy Belgium AT034003    5524354997 M               0     2          0      0        5524354997 M          
# 8     12339 BANKyyy Belgium AT034003            NA <NA>            0     2          0      0                NA <NA> 

如果您需要除第一行之外的所有NA,您可以将mutate更改为:

F2006 = ifelse(row_number()==1,sum(Gender_2006 == "F", na.rm = T),NA),