如何折叠具有重复ID和每个ID缺失值变化的数据框,以便将NA替换为重复ID中的值? (在R中)

时间:2019-01-10 16:19:38

标签: r dataframe na collapse

我有一个很大的数据框,其中ID列中的每个值代表一个人。我想折叠数据框,以便每个ID(人)填充更少的行(重复的ID更少),但我只想折叠ID为4的行的其他列中缺少的值(例如)值构成ID为4的另一行---全部使用R代码

下面的示例数据框。

ID <- c(1, 1, 2, 4, 4, 5)
name <- c('kate', NA, 'jim', NA, 'dan', 'lou')
gender <- c(NA, 'female', 'male', 'male', NA, 'female')

(df <- data.frame(id, name, gender))

  ID name gender
1  1 kate   <NA>
2  1 <NA> female
3  2  jim   male
4  4 <NA>   male
5  4  dan   <NA>
6  5  lou female

结果将是一个数据框,该数据框按ID折叠缺失值,因此来自重复ID的信息会告知另一行中同一ID的缺失列值中应该存在什么。

所需结果:

  ID name gender
1  1 kate female
3  2  jim   male
4  4  dan   male
6  5  lou female

问题是有时候我们有一个像这样的数据框:

ID <- c(1, 1, 2, 4, 4, 5, 5)
name <- c('kate', NA, 'jim', NA, 'dan', 'lou', 'lou smith')
gender <- c(NA, 'female', 'male', 'male', NA, 'female', 'female')
(df2 <- data.frame(ID, name, gender))

  ID      name gender
1  1      kate   <NA>
2  1      <NA> female
3  2       jim   male
4  4      <NA>   male
5  4       dan   <NA>
6  5       lou female
7  5 lou smith female
8  5      <NA> female

如果与重复的ID行有冲突的信息,我也不想删除重复的ID行。在这种情况下,我只希望结果是:

  ID      name gender
1  1      kate female
2  2       jim   male
4  4       dan   male
5  5       lou female
6  5 lou smith female

2 个答案:

答案 0 :(得分:2)

library(dplyr)

ID <- c(1, 1, 2, 4, 4, 5, 5)
name <- c('kate', NA, 'jim', NA, 'dan', 'lou', 'lou smith')
gender <- c(NA, 'female', 'male', 'male', NA, 'female', 'female')
(df2 <- data.frame(ID, name, gender, stringsAsFactors = FALSE))


df2

df2 %>%  
  group_by(ID) %>% 
  mutate(name_max = max(name, na.rm = T), 
         gender_max = max(gender, na.rm = T)) %>% 
ungroup %>% 
mutate(name   = if_else(is.na(name), name_max, name), 
       gender = if_else(is.na(gender), gender_max, gender))   %>% 
  select(ID, name, gender) %>%  
  distinct %>%  
  head(10)

稍作修改:

df2 %>%  
  group_by(ID) %>% 
  mutate(name_max = max(as.character(name), na.rm = T), 
         gender_max = max(as.character(gender), na.rm = T)) %>% 
  ungroup %>% 
  mutate(name   = if_else(is.na(name), name_max, as.character(name)), 
         gender = if_else(is.na(gender), gender_max, 
as.character(gender)))   %>% 
  select(ID, name, gender) %>%  
  distinct()

答案 1 :(得分:1)

如果我们用相邻的非NA替换NA,并获得distinct行,然后使用tidyverse,则使用fill

library(tidyverse)
df2 %>% 
   group_by(ID) %>% 
   fill(name, gender) %>% 
   fill(name, gender, .direction = 'up') %>%
   distinct
# A tibble: 5 x 3
# Groups:   ID [4]
#     ID name      gender
#  <int> <chr>     <chr> 
#1     1 kate      female
#2     2 jim       male  
#3     4 dan       male  
#4     5 lou       female
#5     5 lou smith female

数据

df2 <- structure(list(ID = c(1L, 1L, 2L, 4L, 4L, 5L, 5L, 5L), name = c("kate", 
NA, "jim", NA, "dan", "lou", "lou smith", NA), gender = c(NA, 
"female", "male", "male", NA, "female", "female", "female")),
  class = "data.frame", row.names = c("1", 
 "2", "3", "4", "5", "6", "7", "8"))