我有一个很大的数据框,其中ID列中的每个值代表一个人。我想折叠数据框,以便每个ID(人)填充更少的行(重复的ID更少),但我只想折叠ID为4的行的其他列中缺少的值(例如)值构成ID为4的另一行---全部使用R
代码
下面的示例数据框。
ID <- c(1, 1, 2, 4, 4, 5)
name <- c('kate', NA, 'jim', NA, 'dan', 'lou')
gender <- c(NA, 'female', 'male', 'male', NA, 'female')
(df <- data.frame(id, name, gender))
ID name gender
1 1 kate <NA>
2 1 <NA> female
3 2 jim male
4 4 <NA> male
5 4 dan <NA>
6 5 lou female
结果将是一个数据框,该数据框按ID折叠缺失值,因此来自重复ID的信息会告知另一行中同一ID的缺失列值中应该存在什么。
所需结果:
ID name gender
1 1 kate female
3 2 jim male
4 4 dan male
6 5 lou female
问题是有时候我们有一个像这样的数据框:
ID <- c(1, 1, 2, 4, 4, 5, 5)
name <- c('kate', NA, 'jim', NA, 'dan', 'lou', 'lou smith')
gender <- c(NA, 'female', 'male', 'male', NA, 'female', 'female')
(df2 <- data.frame(ID, name, gender))
ID name gender
1 1 kate <NA>
2 1 <NA> female
3 2 jim male
4 4 <NA> male
5 4 dan <NA>
6 5 lou female
7 5 lou smith female
8 5 <NA> female
如果与重复的ID行有冲突的信息,我也不想删除重复的ID行。在这种情况下,我只希望结果是:
ID name gender
1 1 kate female
2 2 jim male
4 4 dan male
5 5 lou female
6 5 lou smith female
答案 0 :(得分:2)
library(dplyr)
ID <- c(1, 1, 2, 4, 4, 5, 5)
name <- c('kate', NA, 'jim', NA, 'dan', 'lou', 'lou smith')
gender <- c(NA, 'female', 'male', 'male', NA, 'female', 'female')
(df2 <- data.frame(ID, name, gender, stringsAsFactors = FALSE))
df2
df2 %>%
group_by(ID) %>%
mutate(name_max = max(name, na.rm = T),
gender_max = max(gender, na.rm = T)) %>%
ungroup %>%
mutate(name = if_else(is.na(name), name_max, name),
gender = if_else(is.na(gender), gender_max, gender)) %>%
select(ID, name, gender) %>%
distinct %>%
head(10)
稍作修改:
df2 %>%
group_by(ID) %>%
mutate(name_max = max(as.character(name), na.rm = T),
gender_max = max(as.character(gender), na.rm = T)) %>%
ungroup %>%
mutate(name = if_else(is.na(name), name_max, as.character(name)),
gender = if_else(is.na(gender), gender_max,
as.character(gender))) %>%
select(ID, name, gender) %>%
distinct()
答案 1 :(得分:1)
如果我们用相邻的非NA替换NA
,并获得distinct
行,然后使用tidyverse
,则使用fill
library(tidyverse)
df2 %>%
group_by(ID) %>%
fill(name, gender) %>%
fill(name, gender, .direction = 'up') %>%
distinct
# A tibble: 5 x 3
# Groups: ID [4]
# ID name gender
# <int> <chr> <chr>
#1 1 kate female
#2 2 jim male
#3 4 dan male
#4 5 lou female
#5 5 lou smith female
df2 <- structure(list(ID = c(1L, 1L, 2L, 4L, 4L, 5L, 5L, 5L), name = c("kate",
NA, "jim", NA, "dan", "lou", "lou smith", NA), gender = c(NA,
"female", "male", "male", NA, "female", "female", "female")),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))