我想查找各个位置的每日温度与指定的每月平均值相差多少。我当时在想有一个百分比差异值。例如,指定的每月平均值是20,而我的某几天是15(-25%),25(+ 25%)和10(-50%)。
我能想到的唯一方法是在每个位置创建一个月度平均值的重复列,然后使用diff函数或百分比差异公式计算列之间的差异。我想知道是否有一种更优雅,更轻松的方法来适合大数据?
然后,我想使用此每日趋势或差异并将其应用于另一套每月方法,以将其分解为每日数据。例如,假设每月平均值为10,而我的几天趋势为+ 25%(12.5),-50%(5)和-25%(7.5)。再说一次,有没有一种优雅或更简单的方法呢?
任何帮助将不胜感激。我对R还是很陌生!
以下是一些示例数据:
date <- c("2009-01-01", "2009-01-02", "2009-01-03", "2009-01-04","2009-01-05",
"2009-01-01", "2009-01-02", "2009-01-03", "2009-01-04","2009-01-05",
"2009-01-01", "2009-01-02", "2009-01-03", "2009-01-04","2009-01-05")
location <- c("A", "A", "A", "A", "A",
"B", "B", "B", "B", "B",
"C", "C", "C", "C", "C")
daily_temp <- c(10, 12, 12, 9, 8,
13, 14, 18, 8, 11,
14, 18, 20, 16, 17)
data_daily <- cbind(date, location, daily_temp)
mean_monthly <- c(12, 14, 16)
location_monthly <- c("A", "B", "C")
data_monthly <- cbind(mean_monthly, location_monthly)
答案 0 :(得分:1)
使源数据以正确的格式进行分析
df.daily <- as.data.frame( data_daily, stringsAsFactors = F)
df.monthly <- as.data.frame( data_monthly, stringsAsFactors = F)
library( tidyverse )
df.daily <- as.data.frame( data_daily, stringsAsFactors = FALSE )
df.monthly <- as.data.frame( data_monthly, stringsAsFactors = FALSE )
df.daily %>%
left_join( df.monthly, by = c( "location" = "location_monthly" ) ) %>%
mutate( daily_temp = as.numeric( daily_temp ) ) %>%
mutate( mean_monthly = as.numeric( mean_monthly ) ) %>%
mutate( delta_temp = ( daily_temp - mean_monthly ) / mean_monthly )
# date location daily_temp mean_monthly delta_temp
# 1 2009-01-01 A 10 12 -0.16666667
# 2 2009-01-02 A 12 12 0.00000000
# 3 2009-01-03 A 12 12 0.00000000
# 4 2009-01-04 A 9 12 -0.25000000
# 5 2009-01-05 A 8 12 -0.33333333
# 6 2009-01-01 B 13 14 -0.07142857
# 7 2009-01-02 B 14 14 0.00000000
# 8 2009-01-03 B 18 14 0.28571429
# 9 2009-01-04 B 8 14 -0.42857143
# 10 2009-01-05 B 11 14 -0.21428571
# 11 2009-01-01 C 14 16 -0.12500000
# 12 2009-01-02 C 18 16 0.12500000
# 13 2009-01-03 C 20 16 0.25000000
# 14 2009-01-04 C 16 16 0.00000000
# 15 2009-01-05 C 17 16 0.06250000
#less readable but usually faster , especially on larger datasets
library( data.table )
setDT( df.monthly )[, mean_monthly := as.numeric( mean_monthly )][setDT( df.daily )[, daily_temp := as.numeric( daily_temp )], on = c( "location_monthly==location" )][, delta_temp := ( daily_temp - mean_monthly ) / mean_monthly ][]
data.table有一点优势
microbenchmark::microbenchmark( tidyverse = {df.daily %>%
left_join( df.monthly, by = c( "location" = "location_monthly" ) ) %>%
mutate( daily_temp = as.numeric( daily_temp ) ) %>%
mutate( mean_monthly = as.numeric( mean_monthly ) ) %>%
mutate( delta_temp = ( daily_temp - mean_monthly ) / mean_monthly )},
data.table = {setDT(df.monthly)[, mean_monthly := as.numeric( mean_monthly )][setDT(df.daily)[, daily_temp := as.numeric( daily_temp )], on = c( "location_monthly==location" )][, delta_temp := ( daily_temp - mean_monthly ) / mean_monthly ][]},
times = 100)
# Unit: milliseconds
# expr min lq mean median uq max neval
# tidyverse 2.318527 2.408303 2.579056 2.454999 2.513293 13.104373 100
# data.table 1.515959 1.590221 1.669511 1.643545 1.702141 2.345037 100
答案 1 :(得分:1)
以@Wimpel的响应为基础,以下是一些总结位置差异的方法。
df.combo <-
df.daily%>%
left_join( df.monthly, by = c( "location" = "location_monthly" ) ) %>%
mutate( daily_temp = as.numeric( daily_temp ) ) %>%
mutate( mean_monthly = as.numeric( mean_monthly ) ) %>%
mutate( delta_temp = ( daily_temp - mean_monthly ) / mean_monthly ) %>%
# Here I add the difference in degrees between daily temp and monthly avg temp
mutate( temp_dif = daily_temp - mean_monthly)
# For each location, what are some stats about those temp_dif values?
df.loc.stats <-
df.combo %>%
group_by(location) %>%
summarize(mean_dif = mean(temp_dif),
mean_abs_dif = mean(abs(temp_dif)),
SD_dif = sd(temp_dif))
df.loc.stats
表显示位置B的温度变化最大(例如,使用平均绝对差或标准偏差测量),而平均温度下A最低,而C最高:
df.loc.stats
# A tibble: 3 x 4
location mean_dif mean_abs_dif SD_dif
<chr> <dbl> <dbl> <dbl>
1 A -1.8 1.8 1.79
2 B -1.2 2.8 3.70
3 C 1 1.8 2.24