对于具有多种列格式的data.frame,如何基于相同的列值跨data.frame行合并数据

时间:2019-05-30 16:48:08

标签: r

我有以下df格式的原始数据,我需要压缩成更少的行data.frame。数据代表最终的GIS点数据(此处省略了坐标列),因此我希望避免绘制任何重复的点。每一行代表一个点,代理1和代理2中的一个或两个都发布数据。 here将回答一个列数较少的更简单的情况。

对于有些行具有相同的agency3_id的情况,我想将这两行压缩为一行。举例来说,我希望原始data.frame中的第3行和第4行(都具有agency3_id abcde)成为下面所需的data.frame中的一行(第3行)。我愿意接受任何R方法。我确定还有更好的标题问题-感谢您提供帮助。

library(tidyverse)
library(lubridate)

agency1_id   <- as.double(c("1500", NA, "2007", NA, "4501", NA))
agency2_id   <- c(NA, "zxc", NA, "xcv", NA, "bnm")
agency3_id   <- c("qwert", "ertyu", "abcde", "abcde", NA, NA)
agency1_val  <- c(0.21, 1.5, 0.23, NA, 4.3, NA)
agency2_val  <- c(0.05, 4.4, NA, 6.3, NA, 2.0)
agency1_desc <- c("st", NA, "ko", NA, "ui", NA)
agency2_desc <- c(NA, "lo", NA, "vf", NA, "kl")
agency1_dtm  <- ymd_hm(c("2019-05-30 04:30", NA, "2019-05-30 04:35", 
                          NA, "2019-05-30 04:33", NA))
agency2_dtm  <- ymd_hm(c(NA, "2019-05-30 04:20", NA, "2019-05-30 
                          04:29", NA, "2019-05-30 04:31"))

df <- data.frame(agency1_id, agency2_id, agency3_id, agency1_val,
                 agency2_val, agency1_desc, agency2_desc, agency1_dtm,
                 agency2_dtm)
as_tibble(df)

# agency1_id agency2_id agency3_id agency1_val agency2_val agency1_desc agency2_desc agency1_dtm         agency2_dtm        
# <dbl>      <fct>      <fct>       <dbl>       <dbl>        <fct>        <fct>        <dttm>              <dttm>             
# 1 1500       NA         qwert      0.21        0.05        st           NA           2019-05-30 04:30:00 NA                 
# 2 NA         zxc        ertyu      1.5         4.4         NA           lo           NA                  2019-05-30 04:20:00
# 3 2007       NA         abcde      0.23         NA         ko           NA           2019-05-30 04:35:00 NA                 
# 4 NA         xcv        abcde       NA         6.3         NA           vf           NA                  2019-05-30 04:29:00
# 5 4501       NA         NA         4.3          NA         ui           NA           2019-05-30 04:33:00 NA                 
# 6 NA         bnm        NA          NA         2           NA           kl           NA                  2019-05-30 04:31:00

所需的df

# agency1_id agency2_id agency3_id agency1_val agency2_val agency1_desc agency2_desc agency1_dtm         agency2_dtm        
# <dbl>      <fct>      <fct>       <dbl>       <dbl>        <fct>        <fct>        <dttm>              <dttm>             
# 1 1500       NA         qwert      0.21        0.05        st           NA           2019-05-30 04:30:00 NA                 
# 2 NA         zxc        ertyu      1.5         4.4         NA           lo           NA                  2019-05-30 04:20:00
# 3 2007       xcv        abcde      0.23        6.3         ko           vf           2019-05-30 04:35:00 2019-05-30 04:29:00                 
# 4 4501       NA         NA         4.3          NA         ui           NA           2019-05-30 04:33:00 NA                 
# 5 NA         bnm        NA          NA         2           NA           kl           NA                  2019-05-30 04:31:00

1 个答案:

答案 0 :(得分:1)

您可以利用它。可能不是最简洁的解决方案。

# Data with NA values in column - agency3_id
df_na <- df[is.na(df$agency3_id), ]

# Logic
df <- df[!is.na(df$agency3_id), ] %>% 
  group_by(agency3_id) %>% 
  summarise_all(list(~ if(all(is.na(.))) NA else .[!is.na(.)][1]))

# Merge dataframes
rbind(df, df_na)
# Result
# A tibble: 5 x 9
  agency3_id agency1_id agency2_id agency1_val agency2_val agency1_desc agency2_desc agency1_dtm        
* <fct>           <dbl> <fct>            <dbl>       <dbl> <fct>        <fct>        <dttm>             
1 abcde            2007 xcv               0.23        6.3  ko           vf           2019-05-30 04:35:00
2 ertyu              NA zxc               1.5         4.4  NA           lo           NA                 
3 qwert            1500 NA                0.21        0.05 st           NA           2019-05-30 04:30:00
4 NA               4501 NA                4.3        NA    ui           NA           2019-05-30 04:33:00
5 NA                 NA bnm              NA           2    NA           kl           NA                 
# … with 1 more variable: agency2_dtm <dttm>