R-根据来自另一个df的条件,用重复的ID按组和列替换1 df中的值

时间:2019-12-19 16:11:16

标签: r dataframe dplyr rep

我很困。我有2个数据帧-df1具有唯一的站点ID,按月显示的%值以及该站点在数据中的出现次数(年); df2按年份重复了站点ID,并按年份每月重复了值。

df1:代表每个站点每月不漏温度数据的百分比; n代表该电台记录中的年数

station_ID  Jan  Feb  Mar ... Dec  n
10160355    37   39   38      39   141
10160360    94   91   98      89   56
10160390    83   87   85      82   163
df2:每月和每年各站的温度数据; df1中的n是df2中重复的station_ID的长度

station_ID  year  Jan  Feb  Mar ... Dec
10160355    1878  NA   10   12      12
10160355    1879  12   12   13      10
...
10160355    2018  14   11   15      14
10160360    1963  12   10   12      14
10160360    1964  10   12   15      11
...
(repeats for all stations & total rows = 277604)

我需要什么:对于每个月列,如果df1 $ station <50%,用该站/月所有行的NA替换df2中的数据-否则,请保持df2不变。因此,由于df1 $ station_ID [1]仅显示1月的37%,因此该站的所有1月(df2 $ station [1:141])都变为NA。

我需要的示例输出:

station_ID   year  Jan  Feb ...  Dec
10160355     1878  NA   NA       NA
10160355     1879  NA   NA       NA
...
10160360     1963  12   10       14
10160360     1964  10   12       11
...

我已经尝试了大约20种不同的方法,但是我认为我需要某种形式的dplyr和rep,以便在条件为真时为每个工作站重复行的NA。

由于我不知道该怎么做所有列,所以一次只能进行一个月的最新尝试:

df3 =  df2 %>%
    group_by(station_ID) %>%
    select(Jan) %>%
    mutate(if_else(df1$Jan < 50, rep(NA_character_, df1$n), Jan))

这会为代表的无效“时间”提供一个错误。我想我可能会接近,但是我很感谢任何建议!谢谢!

1 个答案:

答案 0 :(得分:0)

以“长”格式执行此操作要容易得多-特别是对于dplyr

library(dplyr)
library(tidyr)

df1_long = pivot_longer(df1, cols = Jan:Dec, names_to = "month", values_to = "non_missing")
df2_long = pivot_longer(df2, cols = Jan:Dec, names_to = "month", values_to = "temp")

result_long = df2_long %>%
  left_join(df1_long) %>%
  mutate(temp = ifelse(non_missing < 50, NA, temp))

result_long
# # A tibble: 20 x 6
#    station_ID  year month  temp     n non_missing
#         <int> <int> <chr> <int> <int>       <int>
#  1   10160355  1878 Jan      NA   141          37
#  2   10160355  1878 Feb      NA   141          39
#  3   10160355  1878 Mar      NA   141          38
#  4   10160355  1878 Dec      NA   141          39
#  5   10160355  1879 Jan      NA   141          37
#  6   10160355  1879 Feb      NA   141          39
#  7   10160355  1879 Mar      NA   141          38
#  8   10160355  1879 Dec      NA   141          39
#  9   10160355  2018 Jan      NA   141          37
# 10   10160355  2018 Feb      NA   141          39
# 11   10160355  2018 Mar      NA   141          38
# 12   10160355  2018 Dec      NA   141          39
# 13   10160360  1963 Jan      12    56          94
# 14   10160360  1963 Feb      10    56          91
# 15   10160360  1963 Mar      12    56          98
# 16   10160360  1963 Dec      14    56          89
# 17   10160360  1964 Jan      10    56          94
# 18   10160360  1964 Feb      12    56          91
# 19   10160360  1964 Mar      15    56          98
# 20   10160360  1964 Dec      11    56          89

在很多情况下(尤其是制作图表,但也要建模),我建议您坚持使用这种长格式数据。但是,可以将其转换回原始的宽格式:

result_wide = result_long %>%
  select(-n, -non_missing) %>%
  pivot_wider(names_from = "month", values_from = "temp")
result_wide
# # A tibble: 5 x 6
#   station_ID  year   Jan   Feb   Mar   Dec
#        <int> <int> <int> <int> <int> <int>
# 1   10160355  1878    NA    NA    NA    NA
# 2   10160355  1879    NA    NA    NA    NA
# 3   10160355  2018    NA    NA    NA    NA
# 4   10160360  1963    12    10    12    14
# 5   10160360  1964    10    12    15    11

使用此数据:

df1 = read.table(text = 'station_ID  Jan  Feb  Mar  Dec  n
10160355    37   39   38      39   141
10160360    94   91   98      89   56
10160390    83   87   85      82   163', header = T)

df2 = read.table(text = 'station_ID  year  Jan  Feb  Mar  Dec
10160355    1878  NA   10   12      12
10160355    1879  12   12   13      10
10160355    2018  14   11   15      14
10160360    1963  12   10   12      14
10160360    1964  10   12   15      11', header = T)