我有一些非常脏的数据,我真的很难清理。问题的一个例子如下:
ID NAME ADDRESS EMAIL PHN
1 Alice 123 Street alice@gmail.com 5555555
1 Alice 123 Street <NA> 4444444
2 Bob 9 Circle Bob@gmail.com 1111111
3 Charlie 4 Ave Charlie@gmail.com 3333333
3 Charlie 4 Ave Charlie@hotmail.com 3333333
3 Charlie 4 Ave <NA> NA
4 Doug 1 Court <NA> 6666666
所需的输出是这样的:
ID NAME ADDRESS EMAIL_1 EMAIL_2 PHN_1 PHN_2
1 Alice 123 Street alice@gmail.com <NA> 5555555 4444444
2 Bob 9 Circle bob@gmail.com <NA> 1111111 NA
3 Charlie 4 Ave charlie@gmail.com charlie@hotmail.com 3333333 NA
4 Doug 1 Court <NA> <NA> 6666666 NA
理解可以对EMAIL
和PHN
变量进行任意扩展(即,可能有 n 重复的ID具有不同(或{ {1}})值。)
到目前为止我的解决方案:
NA
但是这会产生更加错误的data.frame:
df.test <- df %>%
group_by(ID) %>%
mutate(EMAILID = paste0("EMAIL_",row_number())) %>%
spread(EMAILID,EMAIL) %>%
mutate(PHONEID = paste0('PHN_',row_number())) %>%
spread(PHONEID,PHN)
有任何帮助吗?我怀疑我的问题与ID NAME ADDRESS EMAIL_1 EMAIL_2 EMAIL_3 PHN_1 PHN_2 PHN_3
1 Alice 123 Street alice@gmail.com <NA> <NA> 5555555 NA NA
1 Alice 123 Street <NA> <NA> <NA> NA 4444444 NA
2 Bob 9 Circle Bob@gmail.com <NA> <NA> 1111111 NA NA
3 Charlie 4 Ave Charlie@gmail.com <NA> <NA> 3333333 NA NA
3 Charlie 4 Ave <NA> Charlie@hotmail.com <NA> NA 3333333 NA
3 Charlie 4 Ave <NA> <NA> <NA> NA NA NA
4 Doug 1 Court <NA> <NA> <NA> 6666666 NA NA
命令有关,但到目前为止,我的尝试已被证明是徒劳无功的。感谢。
答案 0 :(得分:3)
您需要summarize
而非mutate
,然后使用separate
拆分结果。要动态执行此操作,您可以提前确定要使用的不同电子邮件和电话组的数量,使用separate_
,然后设置fill = right
以删除警告。最后两个mutate
语句用于清除变为字符串的NA
值。
library(dplyr)
library(tidyr)
cols <- cols <- df %>%
group_by(ID) %>%
filter(!is.na(PHN), !is.na(EMAIL)) %>%
group_size() %>%
max()
df %>%
group_by(ID, NAME, ADDRESS) %>%
summarize_each(funs(toString(unique(.[!is.na(.)]))), EMAIL, PHN) %>%
separate_("EMAIL", sprintf("EMAIL%s", 1:cols), sep = ",", fill = "right") %>%
separate_("PHN", sprintf("PHN%s", 1:cols), sep = ",", fill = "right") %>%
mutate_if(is.character, trimws) %>%
mutate_each(funs(replace(., grep("NA", .), NA)))
Source: local data frame [4 x 7]
Groups: ID, NAME [4]
ID NAME ADDRESS EMAIL1 EMAIL2 PHN1 PHN2
<int> <fctr> <fctr> <chr> <chr> <chr> <chr>
1 1 Alice 123 Street alice@gmail.com <NA> 5555555 4444444
2 2 Bob 9 Circle Bob@gmail.com <NA> 1111111 <NA>
3 3 Charlie 4 Ave Charlie@gmail.com Charlie@hotmail.com 3333333 <NA>
4 4 Doug 1 Court <NA> <NA> 6666666 <NA>
警告将被抛出
答案 1 :(得分:0)
1)重塑使用基数R可以在3行中完成。第一行代码为每个ID
添加一个序列号,最后一行执行从long到wide的转换。第二行代码将数据帧从long重新整形为宽,最后一行代码删除仅包含NA的列。 (如果不太可能是NA的列,或者你不介意它们,那么可以省略第三行代码。)
df2 <- transform(df.test, seq = ave(ID, ID, FUN = seq_along))
df2 <- reshape(df2, dir = "wide", timevar = "seq", idvar = c("ID", "NAME", "ADDRESS"))
subset(df2, select = !apply(is.na(df.test2), 2, all))
,并提供:
ID NAME ADDRESS EMAIL.1 PHN.1 EMAIL.2 PHN.2
1 1 Alice 123 Street alice@gmail.com 5555555 <NA> 4444444
3 2 Bob 9 Circle Bob@gmail.com 1111111 <NA> NA
4 3 Charlie 4 Ave Charlie@gmail.com 3333333 Charlie@hotmail.com 3333333
7 4 Doug 1 Court <NA> 6666666 <NA> NA
2)magrittr 除了形成magrittr管道之外,可以编写相同的代码:
library(magrittr)
df.test %>%
transform(seq = ave(ID, ID, FUN = seq_along)) %>%
reshape(dir = "wide", timevar = "seq", idvar = c("ID", "NAME", "ADDRESS")) %>%
subset(select = !apply(is.na(.), 2, all))
注意:可重复形式的输入df.test
为:
Lines <- "
ID,NAME,ADDRESS,EMAIL,PHN
1,Alice,123 Street,alice@gmail.com,5555555
1,Alice,123 Street,NA,4444444
2,Bob,9 Circle,Bob@gmail.com,1111111
3,Charlie,4 Ave,Charlie@gmail.com,3333333
3,Charlie,4 Ave,Charlie@hotmail.com,3333333
3,Charlie,4 Ave,NA,
4,Doug,1 Court,NA,6666666"
df.test <- read.csv(text=Lines)
答案 2 :(得分:0)
对于关注summarize_each
弃用警告的任何人,以下代码适用于当前支持的功能:
df.test %>%
group_by(ID, NAME, ADDRESS) %>%
summarize_at(vars(EMAIL, PHN), funs(toString(unique(.[!is.na(.)])))) %>%
separate(EMAIL, sprintf('EMAIL%s', 1:cols), sep = ",", fill = 'right') %>%
separate(PHN, sprintf('PHN%s', 1:cols), sep = ",", fill = 'right') %>%
mutate_if(is.character, trimws) %>%
mutate_all(funs(replace(., grep("NA", .), NA)))