这是一组2个数据帧。
id <- c(1,2,3,4)
id2 <- c(5,6,7,8)
list <- c("list1","list2","list3","list4")
progress <- c("A", "A", "B", "C")
grade <- c("A", NA, "B", "C")
df1 <- data.frame(id, id2, list, progress, grade)
df1
id <- c(1,2,3,5)
id2 <- c(5,6,7,9)
list <- c("list1","list2","list5","list6")
progress <- c("B", "B", "A", "D")
grade2 <- c("B", NA, "B", "D")
df2 <- data.frame(id, id2, list, progress, grade2)
df2
我希望以这样的方式组合df1
和df2
,
a)对于列list
,如果存在id
和id2
的重复值,则list
的相应值也应匹配。否则,应返回值NA
。此条件不适用于id
和id2
的唯一值。
b)对于progress
列,如果有id
和id2
的重复值,则必须采用首次出现的值。
c)对于列grade
和grade2
,如果存在重复的值id
和id2
,则在这种情况下必须删除NA
。
预期输出如下:-
#id id2 list progress grade grade2
#1 5 list1 A A B
#2 6 list2 A NA NA
#3 7 NA B B B
#4 8 list4 C C NA
#5 9 list6 D NA D
答案 0 :(得分:2)
由于您的初始数据结构,这个答案相当复杂,但这是我在dplyr
中使用工具的解决方案:
library(dplyr)
# Bind the rows of the two dataframes together
bind_rows(df1, df2) %>%
# a) For each pair of id and id2...
group_by(id, id2) %>%
# ...when there is more than one list, set to NA, otherwise, take the value
mutate(list = case_when(length(unique(list)) > 1 ~ NA_character_,
TRUE ~ unique(list))) %>%
# b) Take the first occurring progress value (still for each id, id2 pair)
mutate(progress = progress[1]) %>%
ungroup() %>%
# Keep distinct pairs
distinct(id, id2, list, progress) %>%
# c)
# Create a smaller data set of the non-NA grade for the id, id2 pairs
# Joint it onto the larger data set
left_join(
bind_rows(df1, df2) %>%
select(id, id2, grade) %>%
na.omit(),
by = c("id", "id2")
) %>%
# c continued)
# Create a smaller data set of the non-NA grade2 for the id, id2 pairs
# Joint it onto the larger data set
left_join(
bind_rows(df1, df2) %>%
select(id, id2, grade2) %>%
na.omit(),
by = c("id", "id2")
)
答案 1 :(得分:2)
“第一”困扰着我,但这似乎与您想要的输出相匹配:
library(tidyverse)
bind_rows(
left_join(df1, df2, by = c('id', 'id2', 'list', 'progress'), ),
anti_join(df2, df1, by = c('id', 'id2', 'list', 'progress'))
) %>%
group_by(id, id2) %>%
mutate(
list = ifelse(n_distinct(list) > 1, NA, list),
progress = first(progress),
grade = first(grade),
grade2 = first(na.omit(grade2))
) %>%
ungroup() %>%
distinct()
输出:
# # A tibble: 5 x 6
# id id2 list progress grade grade2
# <dbl> <dbl> <chr> <chr> <chr> <chr>
# 1 1 5 list1 A A B
# 2 2 6 list2 A NA NA
# 3 3 7 NA B B B
# 4 4 8 list4 C C NA
# 5 5 9 list6 D NA D
数据:
df1 <- data.frame(
id = 1:4,
id2 = 5:8,
list = paste0('list', 1:4),
progress = c('A', 'A', 'B', 'C'),
grade = c('A', NA, 'B', 'C'),
stringsAsFactors = FALSE
)
df2 <- data.frame(
id = c(1, 2, 3, 5),
id2 = c(5, 6, 7, 9),
list = paste0('list', c(1, 2, 5, 6)),
progress = c('B', 'B', 'A', 'D'),
grade2 = c('B', NA, 'B', 'D'),
stringsAsFactors = FALSE
)
答案 2 :(得分:1)
这是""
和"-0700"
软件包中的另一种选择。
time.Parse("Mon Jan 02 2006 15:04:05 XYZ-0700", "Tue Jun 11 2019 13:26:45 XYZ+0800")