tidyr / dplyr - 为重复的id传播多个变量

时间:2017-02-13 16:22:11

标签: r dplyr tidyr

我有一些非常脏的数据,我真的很难清理。问题的一个例子如下:

ID    NAME    ADDRESS               EMAIL     PHN
1   Alice 123 Street     alice@gmail.com 5555555
1   Alice 123 Street                <NA> 4444444
2     Bob   9 Circle       Bob@gmail.com 1111111
3 Charlie      4 Ave   Charlie@gmail.com 3333333
3 Charlie      4 Ave Charlie@hotmail.com 3333333
3 Charlie      4 Ave                <NA>      NA
4    Doug    1 Court                <NA> 6666666

所需的输出是这样的:

ID    NAME    ADDRESS           EMAIL_1             EMAIL_2   PHN_1   PHN_2
1   Alice 123 Street   alice@gmail.com                <NA> 5555555 4444444
2     Bob   9 Circle     bob@gmail.com                <NA> 1111111      NA
3 Charlie      4 Ave charlie@gmail.com charlie@hotmail.com 3333333      NA
4    Doug    1 Court              <NA>                <NA> 6666666      NA

理解可以对EMAILPHN变量进行任意扩展(即,可能有 n 重复的ID具有不同(或{ {1}})值。)

到目前为止我的解决方案:

NA

但是这会产生更加错误的data.frame:

df.test <- df %>%
  group_by(ID) %>%
  mutate(EMAILID = paste0("EMAIL_",row_number())) %>%
  spread(EMAILID,EMAIL) %>%
  mutate(PHONEID = paste0('PHN_',row_number())) %>%
  spread(PHONEID,PHN)

有任何帮助吗?我怀疑我的问题与ID NAME ADDRESS EMAIL_1 EMAIL_2 EMAIL_3 PHN_1 PHN_2 PHN_3 1 Alice 123 Street alice@gmail.com <NA> <NA> 5555555 NA NA 1 Alice 123 Street <NA> <NA> <NA> NA 4444444 NA 2 Bob 9 Circle Bob@gmail.com <NA> <NA> 1111111 NA NA 3 Charlie 4 Ave Charlie@gmail.com <NA> <NA> 3333333 NA NA 3 Charlie 4 Ave <NA> Charlie@hotmail.com <NA> NA 3333333 NA 3 Charlie 4 Ave <NA> <NA> <NA> NA NA NA 4 Doug 1 Court <NA> <NA> <NA> 6666666 NA NA 命令有关,但到目前为止,我的尝试已被证明是徒劳无功的。感谢。

3 个答案:

答案 0 :(得分:3)

您需要summarize而非mutate,然后使用separate拆分结果。要动态执行此操作,您可以提前确定要使用的不同电子邮件和电话组的数量,使用separate_,然后设置fill = right以删除警告。最后两个mutate语句用于清除变为字符串的NA值。

library(dplyr)
library(tidyr)

cols <- cols <- df %>% 
  group_by(ID) %>% 
  filter(!is.na(PHN), !is.na(EMAIL)) %>% 
  group_size() %>% 
  max()

df %>%
  group_by(ID, NAME, ADDRESS) %>%
  summarize_each(funs(toString(unique(.[!is.na(.)]))), EMAIL, PHN) %>% 
  separate_("EMAIL", sprintf("EMAIL%s", 1:cols), sep = ",", fill = "right") %>% 
  separate_("PHN", sprintf("PHN%s", 1:cols), sep = ",", fill = "right") %>% 
  mutate_if(is.character, trimws) %>% 
  mutate_each(funs(replace(., grep("NA", .), NA)))

  Source: local data frame [4 x 7]
Groups: ID, NAME [4]

     ID    NAME    ADDRESS            EMAIL1              EMAIL2    PHN1    PHN2
  <int>  <fctr>     <fctr>             <chr>               <chr>   <chr>   <chr>
1     1   Alice 123 Street   alice@gmail.com                <NA> 5555555 4444444
2     2     Bob   9 Circle     Bob@gmail.com                <NA> 1111111    <NA>
3     3 Charlie      4 Ave Charlie@gmail.com Charlie@hotmail.com 3333333    <NA>
4     4    Doug    1 Court              <NA>                <NA> 6666666    <NA>

警告将被抛出

答案 1 :(得分:0)

1)重塑使用基数R可以在3行中完成。第一行代码为每个ID添加一个序列号,最后一行执行从long到wide的转换。第二行代码将数据帧从long重新整形为宽,最后一行代码删除仅包含NA的列。 (如果不太可能是NA的列,或者你不介意它们,那么可以省略第三行代码。)

df2 <- transform(df.test, seq = ave(ID, ID, FUN = seq_along))
df2 <- reshape(df2, dir = "wide", timevar = "seq", idvar = c("ID", "NAME", "ADDRESS"))
subset(df2, select = !apply(is.na(df.test2), 2, all))

,并提供:

  ID    NAME    ADDRESS           EMAIL.1   PHN.1             EMAIL.2   PHN.2
1  1   Alice 123 Street   alice@gmail.com 5555555                <NA> 4444444
3  2     Bob   9 Circle     Bob@gmail.com 1111111                <NA>      NA
4  3 Charlie      4 Ave Charlie@gmail.com 3333333 Charlie@hotmail.com 3333333
7  4    Doug    1 Court              <NA> 6666666                <NA>      NA

2)magrittr 除了形成magrittr管道之外,可以编写相同的代码:

library(magrittr)

df.test %>%
   transform(seq = ave(ID, ID, FUN = seq_along)) %>%
   reshape(dir = "wide", timevar = "seq", idvar = c("ID", "NAME", "ADDRESS")) %>%
   subset(select = !apply(is.na(.), 2, all))

注意:可重复形式的输入df.test为:

Lines <- "
ID,NAME,ADDRESS,EMAIL,PHN
1,Alice,123 Street,alice@gmail.com,5555555
1,Alice,123 Street,NA,4444444
2,Bob,9 Circle,Bob@gmail.com,1111111
3,Charlie,4 Ave,Charlie@gmail.com,3333333
3,Charlie,4 Ave,Charlie@hotmail.com,3333333
3,Charlie,4 Ave,NA,
4,Doug,1 Court,NA,6666666"
df.test <- read.csv(text=Lines)

答案 2 :(得分:0)

对于关注summarize_each弃用警告的任何人,以下代码适用于当前支持的功能:

df.test %>% 
  group_by(ID, NAME, ADDRESS) %>%
  summarize_at(vars(EMAIL, PHN), funs(toString(unique(.[!is.na(.)])))) %>%
  separate(EMAIL, sprintf('EMAIL%s', 1:cols), sep = ",", fill = 'right') %>%
  separate(PHN, sprintf('PHN%s', 1:cols), sep = ",", fill = 'right') %>%
  mutate_if(is.character, trimws) %>%
  mutate_all(funs(replace(., grep("NA", .), NA)))