我很困。我有2个数据帧-df1具有唯一的站点ID,按月显示的%值以及该站点在数据中的出现次数(年); df2按年份重复了站点ID,并按年份每月重复了值。
df1:代表每个站点每月不漏温度数据的百分比; n代表该电台记录中的年数
station_ID Jan Feb Mar ... Dec n
10160355 37 39 38 39 141
10160360 94 91 98 89 56
10160390 83 87 85 82 163
df2:每月和每年各站的温度数据; df1中的n是df2中重复的station_ID的长度
station_ID year Jan Feb Mar ... Dec
10160355 1878 NA 10 12 12
10160355 1879 12 12 13 10
...
10160355 2018 14 11 15 14
10160360 1963 12 10 12 14
10160360 1964 10 12 15 11
...
(repeats for all stations & total rows = 277604)
我需要什么:对于每个月列,如果df1 $ station <50%,用该站/月所有行的NA替换df2中的数据-否则,请保持df2不变。因此,由于df1 $ station_ID [1]仅显示1月的37%,因此该站的所有1月(df2 $ station [1:141])都变为NA。
我需要的示例输出:
station_ID year Jan Feb ... Dec
10160355 1878 NA NA NA
10160355 1879 NA NA NA
...
10160360 1963 12 10 14
10160360 1964 10 12 11
...
我已经尝试了大约20种不同的方法,但是我认为我需要某种形式的dplyr和rep,以便在条件为真时为每个工作站重复行的NA。
由于我不知道该怎么做所有列,所以一次只能进行一个月的最新尝试:
df3 = df2 %>%
group_by(station_ID) %>%
select(Jan) %>%
mutate(if_else(df1$Jan < 50, rep(NA_character_, df1$n), Jan))
这会为代表的无效“时间”提供一个错误。我想我可能会接近,但是我很感谢任何建议!谢谢!
答案 0 :(得分:0)
以“长”格式执行此操作要容易得多-特别是对于dplyr
library(dplyr)
library(tidyr)
df1_long = pivot_longer(df1, cols = Jan:Dec, names_to = "month", values_to = "non_missing")
df2_long = pivot_longer(df2, cols = Jan:Dec, names_to = "month", values_to = "temp")
result_long = df2_long %>%
left_join(df1_long) %>%
mutate(temp = ifelse(non_missing < 50, NA, temp))
result_long
# # A tibble: 20 x 6
# station_ID year month temp n non_missing
# <int> <int> <chr> <int> <int> <int>
# 1 10160355 1878 Jan NA 141 37
# 2 10160355 1878 Feb NA 141 39
# 3 10160355 1878 Mar NA 141 38
# 4 10160355 1878 Dec NA 141 39
# 5 10160355 1879 Jan NA 141 37
# 6 10160355 1879 Feb NA 141 39
# 7 10160355 1879 Mar NA 141 38
# 8 10160355 1879 Dec NA 141 39
# 9 10160355 2018 Jan NA 141 37
# 10 10160355 2018 Feb NA 141 39
# 11 10160355 2018 Mar NA 141 38
# 12 10160355 2018 Dec NA 141 39
# 13 10160360 1963 Jan 12 56 94
# 14 10160360 1963 Feb 10 56 91
# 15 10160360 1963 Mar 12 56 98
# 16 10160360 1963 Dec 14 56 89
# 17 10160360 1964 Jan 10 56 94
# 18 10160360 1964 Feb 12 56 91
# 19 10160360 1964 Mar 15 56 98
# 20 10160360 1964 Dec 11 56 89
在很多情况下(尤其是制作图表,但也要建模),我建议您坚持使用这种长格式数据。但是,可以将其转换回原始的宽格式:
result_wide = result_long %>%
select(-n, -non_missing) %>%
pivot_wider(names_from = "month", values_from = "temp")
result_wide
# # A tibble: 5 x 6
# station_ID year Jan Feb Mar Dec
# <int> <int> <int> <int> <int> <int>
# 1 10160355 1878 NA NA NA NA
# 2 10160355 1879 NA NA NA NA
# 3 10160355 2018 NA NA NA NA
# 4 10160360 1963 12 10 12 14
# 5 10160360 1964 10 12 15 11
使用此数据:
df1 = read.table(text = 'station_ID Jan Feb Mar Dec n
10160355 37 39 38 39 141
10160360 94 91 98 89 56
10160390 83 87 85 82 163', header = T)
df2 = read.table(text = 'station_ID year Jan Feb Mar Dec
10160355 1878 NA 10 12 12
10160355 1879 12 12 13 10
10160355 2018 14 11 15 14
10160360 1963 12 10 12 14
10160360 1964 10 12 15 11', header = T)