使R计算元素内的字符串数

时间:2018-06-28 10:38:14

标签: r

df <- structure(list(ID = c("1", "2", "3", "4", "5", "6"), Column1 = c(NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), Column2 = c("2011", "2015", "2015", "2006, 2006, 2005, 2005, 2007", 
"2014, 2011", "2007"), `Cut-Off` = c("2011", "2015", "2015", 
"2005", "2011", "2007"), `2005` = c(NA, NA, NA, "30", "18", NA
), `2006` = c(NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_), `2007` = c("15", NA, "18", NA, 
"30, 18", NA), `2008` = c("16", NA, NA, "30, 27", "18, 30", NA
), `2009` = c("15", NA, NA, "20", "30, 18", NA), `2010` = c(NA, 
NA, NA, "30, 20", NA, NA), `2011` = c(NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_), 
    `2012` = c(NA, NA, NA, "20, 30", NA, "26"), `2013` = c("15", 
    NA, "19", NA, NA, NA), `2014` = c(NA, NA, "18", NA, NA, NA
    ), `2015` = c(NA, NA, "18", NA, "18", NA), `2016` = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_)), .Names = c("ID", "Column1", "Column2", "Cut-Off", 
"2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", 
"2013", "2014", "2015", "2016"), row.names = c(NA, 6L), class = "data.frame")

给出上面的数据框。我想R要做的是,查看截止年份(第4列),然后在数据框的末尾创建2个新列,其中一列包含每个元素内唯一的“标识符”的总数截止年份之前的每一年,另一列包含截止年份之后的总数。截止年份列中的标识符不应该包含在内。

下面的数据框显示了所需的输出。

例如,在第一行中,截止年是2011,而截止年之前的2007、2008和2009年分别具有标识符15、16和15。因此,标识符的唯一数目是15和16(第二个15被删除),然后在“之前”列中的计数为“ 2”。截止年份之后,只有2013年有一个标识符,因此它在“之后”列中的计数为“ 1”。

如果一个元素中有2个或多个标识符(例如,在第4和5行中表示“ 30、27”或“ 30、18”),则仍应将其视为由逗号分隔的标识符。 / p>

df_solution <- structure(list(ID = c("1", "2", "3", "4", "5", "6"), Column1 = c(NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), Column2 = c("2011", "2015", "2015", "2006, 2006, 2005, 2005, 2007", 
"2014, 2011", "2007"), `Cut-Off` = c("2011", "2015", "2015", 
"2005", "2011", "2007"), `2005` = c(NA, NA, NA, "30", "18", NA
), `2006` = c(NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_), `2007` = c("15", NA, "18", NA, 
"30, 18", NA), `2008` = c("16", NA, NA, "30, 27", "18, 30", NA
), `2009` = c("15", NA, NA, "20", "30, 18", NA), `2010` = c(NA, 
NA, NA, "30, 20", NA, NA), `2011` = c(NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_), 
    `2012` = c(NA, NA, NA, "20, 30", NA, "26"), `2013` = c("15", 
    NA, "19", NA, NA, NA), `2014` = c(NA, NA, "18", NA, NA, NA
    ), `2015` = c(NA, NA, "18", NA, "18", NA), `2016` = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), Before = c(2, 0, 2, 0, 2, 0), After = c(1, 
    0, 0, 3, 1, 1)), .Names = c("ID", "Column1", "Column2", "Cut-Off", 
"2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", 
"2013", "2014", "2015", "2016", "Before", "After"), row.names = c(NA, 
6L), class = "data.frame")

1 个答案:

答案 0 :(得分:2)

library(tidyverse)

df %>% 
  select(-Column1, - Column2) %>%         # remove those columns
  gather(year,value,-ID, -`Cut-Off`) %>%  # reshape data
  na.omit() %>%                           # remove rows with NA
  separate_rows(value) %>%                # split values (using commas)
  group_by(ID, `Cut-Off`) %>%             # for each ID and cut-off
  summarise(Before = n_distinct(value[as.numeric(`Cut-Off`) > as.numeric(year)]),     # count distinct values where cut-off is after the dates
            After = n_distinct(value[as.numeric(`Cut-Off`) < as.numeric(year)])) %>%  # count distinct values where cut-off is before the dates
  ungroup()  %>%                     # forget the grouping
  select(-`Cut-Off`) %>%             # remove cut-off column
  right_join(df, by="ID") %>%        # join back original dataset
  mutate_at(vars(Before,After), ~coalesce(.,0L))  # replace NAs with 0 in those two columns


# # A tibble: 6 x 18
# ID    Before After Column1 Column2      `Cut-Off` `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012`
#   <chr>  <int> <int> <chr>   <chr>        <chr>     <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
# 1 1          2     1 NA      2011         2011      NA     NA     15     16     15     NA     NA     NA    
# 2 2          0     0 NA      2015         2015      NA     NA     NA     NA     NA     NA     NA     NA    
# 3 3          2     0 NA      2015         2015      NA     NA     18     NA     NA     NA     NA     NA    
# 4 4          0     3 NA      2006, 2006,~ 2005      30     NA     NA     30, 27 20     30, 20 NA     20, 30
# 5 5          2     1 NA      2014, 2011   2011      18     NA     30, 18 18, 30 30, 18 NA     NA     NA    
# 6 6          0     1 NA      2007         2007      NA     NA     NA     NA     NA     NA     NA     26    
# # ... with 4 more variables: `2013` <chr>, `2014` <chr>, `2015` <chr>, `2016` <chr>