亲爱的堆栈溢出社区,
我是使用R进行统计编程世界的新手。我被赋予了创建一个简单的自回归模型的任务,我可以用这个模型预测,或者我应该说,只使用谷歌的数据预测一个国家的失业率趋势。为了创建该模型,我获得了一个.csv文件,其中包含2011年至2015年(5年)的失业率和.csv文件,其中包含“失业”主题(2011-2015)的Google趋势值。
可以想象,我已将这两个文件导入RStudio并将其转换为时间序列(60个月)。以下是概述:
Unemployment Rates vs Google Trends
我现在需要帮助创建AR模型。请记住,这个模型应该尽可能简单,并不是完美的。以下是我的问题:
由于我对R不是很有经验,所以我有点失落。非常感谢帮助!
非常感谢!
以下是data(示例在下面的代码中提供)
到目前为止,这是我的代码:
# Import required libraries
library(lubridate)
library(tseries)
library(xts)
library(forecast)
library(readr)
# # # # # # # # # # # Unemployment Rate # # # # # # # # # # #
unemploymentRate <- read_csv("~/Desktop/UnemploymentRates_2011-2015.csv")
# Unemployment sample: structure(list(`Month` = 1:10, Year = c(2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L), UnemploymentRate = c(7.9, 7.9, 7.6, 7.3, 7, 6.9, 7, 7, 6.6, 6.5)), .Names = c("Month", "Year", "UnemploymentRate"), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))
# Create monthly time series for unemployment rates
tsUnemployment <- ts(unemploymentRate$UnemploymentRate, start = c(2011,1), frequency = 12)
# # # # # # # # # # # Google Trends Topic # # # # # # # # # # #
google <- read_csv("~/Desktop/google.csv", col_types = cols(Woche = col_date(format="%Y-%m-%d")))
colnames(google)[2] <- "googleTrend"
#Google sample: structure(list(Week = structure(c(14976, 14983, 14990, 14997, 15004, 15011, 15018, 15025, 15032, 15039), class = "Date"), Unemployment = c(88L, 89L, 100L, 91L, 88L, 88L, 87L, 91L, 89L, 78L)), .Names = c("Week", "Unemployment"), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))
# Extract month and year from date
google$Month <- month(google$Week, abbr = FALSE)
google$Year <- year(google$Week)
# Aggregate weeks into months using the mean
aggGoogle <- aggregate(google$googleTrends ~ Month + Year , google, mean)
colnames(aggGoogle)[3] <- "aggGoogleTrends"
# Create monthly time series for the Google Trends
tsGoogle <- ts(aggGoogle$aggGoogleTrends, start = c(2011,1), frequency = 12)
# # # # # # # # # # # Decomposition + Analysis # # # # # # # # # # #
decompose_Unemployment <- decompose(tsUnemployment, "additive")
decompose_Google <- decompose(tsGoogle, "additive")
finalUnemployment <- decompose_Unemployment$seasonal + decompose_Unemployment$trend + decompose_Unemployment$random
finalGoogle <- decompose_Google$seasonal + decompose_Google$trend + decompose_Google$random
现在,我准备进行统计测试了:
adf.test(tsUnemployment, alternative = "stationary")
Box.test(tsUnemployment, type = "Ljung-Box")
Box.test(finalUnemployment, type = "Ljung-Box")
adf.test(tsGoogle, alternative = "stationary")
Box.test(tsGoogle, type = "Ljung-Box")
Box.test(finalGoogle, type = "Ljung-Box")
答案 0 :(得分:0)
(就像@ eipi10评论的那样,这更像是Cross Validated,Data Science或Mathematics的问题,尤其是你似乎没有代码和统计数据的问题测试。如果你在这里得到的答案没有帮助,你应该考虑在那些地方提问)
对问题1的建议:这个问题特别难以回答,因为它依赖于您的数据。基于this page,如果您决定使用AR,那么应用分解模型是一件合适的事情。但是,这并不意味着分解是您唯一的选择。
对问题2的建议:要在R中实现自回归(AR)模型,最简单的方法来自stats
包。只要您有时间序列数据集,函数stats::ar
应该适合您。如果您的数据属于data.frame
但不是时间序列(ts
),则可以使用stats::ts
函数进行转换。