我有一张天气数据表,其目的是基于
运行模型 a)天气数据
b)天气+ 1sd
c)天气-1sd
提出某种类型的置信区间。所以在这里,我有两个城市的日常临时工,然后按月分列标准开发工具的等效表。我想要做的是编写一个函数,通过将相关的每月st.devs应用于每个值来转换数据帧。即就以下情况而言,我想分别为博伊西和爱达荷瀑布的每个11月价值增加9.07度和9.37度...然后分别为所有12月价值增加9.15和11.03度,再次分别为博伊西和爱达荷州的跌幅
我知道我可以通过一些中间步骤以“杂乱”的方式做到这一点,创建一些列然后最终清理它们。但是,为了学习,我想了解如何执行更优雅的解决方案。
df <- structure(list(Date = c("2014-11-01", "2014-11-02", "2014-11-03",
"2014-11-04", "2014-11-05", "2014-11-06", "2014-11-07", "2014-11-08",
"2014-11-09", "2014-11-10", "2014-11-11", "2014-11-12", "2014-11-13",
"2014-11-14", "2014-11-15", "2014-11-16", "2014-11-17", "2014-11-18",
"2014-11-19", "2014-11-20", "2014-11-21", "2014-11-22", "2014-11-23",
"2014-11-24", "2014-11-25", "2014-11-26", "2014-11-27", "2014-11-28",
"2014-11-29", "2014-11-30", "2014-12-01", "2014-12-02", "2014-12-03",
"2014-12-04", "2014-12-05", "2014-12-06", "2014-12-07", "2014-12-08",
"2014-12-09", "2014-12-10", "2014-12-11", "2014-12-12", "2014-12-13",
"2014-12-14", "2014-12-15", "2014-12-16", "2014-12-17", "2014-12-18",
"2014-12-19", "2014-12-20", "2014-12-21", "2014-12-22", "2014-12-23",
"2014-12-24", "2014-12-25", "2014-12-26", "2014-12-27", "2014-12-28",
"2014-12-29", "2014-12-30"), BOISE = c(44.5, 42.5, 43.5, 47.5,
55, 57.5, 49.5, 47.5, 45, 38, 31, 23.5, 24, 21.5, 11.5, 13, 13,
13, 16, 22, 32, 42, 37, 38, 46.5, 48.5, 49.5, 52.5, 42, 26, 31.5,
33, 40, 48.5, 40, 44, 43.5, 42, 42.5, 46, 57, 51, 39.5, 34, 36.5,
39, 36.5, 40.5, 40.5, 40, 43.5, 39.5, 35.5, 33, 32, 29, 27, 31,
27, 20.5699996948242), `IDAHO FALLS` = c(54.5, 36, 34.5, 35.5,
41, 41.5, 47, 39, 45.5, 36, 15, 13, 14, 26, 4.5, 2.5, 8, 11,
28, 27, 27, 35.5, 31.5, 33, 39, 43, 45.5, 46, 42.5, 28.5, 27,
34, 35.5, 42, 36.5, 42.5, 35, 36, 34.5, 36.5, 42.5, 47, 39, 28,
23.5, 31, 22.5, 24.5, 34.5, 35, 38.5, 34, 27.5, 31.5, 24.5, 8.5,
15, 19, 10.5, -3.46000003814697)), class = "data.frame", .Names = c("Date",
"BOISE", "IDAHO FALLS"), row.names = c(NA, -60L))
sd_matrix <- structure(list(month = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
), BOISE = c(7.90623167260698, 6.46123050256436, 6.38106936624632,
7.22283114115187, 7.76515042234502, 8.10445388054925, 5.65058663778116,
6.18033208264487, 7.34160028246709, 7.48784870009556, 9.07481352622016,
9.15757443706943), `IDAHO FALLS` = c(10.4267588417941, 9.89036971863809,
7.99156512696757, 6.84627542213131, 6.6696338642145, 6.823026513784,
4.31982292105468, 4.63179196395735, 6.38702016727256, 7.31441201561822,
9.37466284053354, 11.0316440728702)), class = "data.frame", row.names = c(NA,
-12L), .Names = c("month", "BOISE", "IDAHO FALLS"))
这是一些hacky代码,它在这个特定的实例中提供了正确的结果,但没有提供变量名称和大小等内容,我将不得不处理 -
df$month <- month(df$Date)
df <- inner_join(df, sd_matrix, by="month")
df$BOISE.x <- df$BOISE.x + df$BOISE.y
df$`IDAHO FALLS.x` <- df$`IDAHO FALLS.x` + df$`IDAHO FALLS.y`
df <- df %>%
select(Date, BOISE.x, `IDAHO FALLS.x`)
names(df) <- c("Date,", "Boise", "Idaho Falls")
答案 0 :(得分:3)
你应该真正阅读Tidy Data paper - 它为思考这样的事情提供了一个非常有用的框架。该框架会说您的数据不整洁,因为您在列名中编码信息;即,&#34; location&#34;是一个重要的数据,但不是将位置放在一个列中,而是将它放在多个列名中,这使得一切都比它需要的更加困难。
我们使用tidyr::gather
将您的数据转换为长格式,只需一个位置列和一个温度列:
library(tidyr)
l_df = gather(df, key = loc, value = temp, -Date)
l_sd = gather(sd_matrix, key = loc, value = sd, -month)
完成后,我们可以在位置和月份上进行简单的连接,然后根据需要添加和减去标准偏差:
result = mutate(l_df, month = lubridate::month(Date)) %>%
inner_join(l_sd) %>%
mutate(temp_u1 = temp + sd,
temp_l1 = temp - sd)
此时可以使用tidyr::spread
返回宽格式,但我建议您以此格式保留数据。或者甚至可能更方便的是进入更长的格式,而不是在列名称中编码+/- SD信息,而是有一个SD乘数列,其值为-1, 0, 1
和单个临时列。我上面的格式可以很好地用于例如绘制置信带。如果你感兴趣的是,+ / - 2,1.5,1,.5标准偏差,并且每个估算都运行代码,那么更长的格式会更好地概括。