时间序列数据包含:
产品(分类); ProductGroup(类别);国家(分类); YearSinceProductLaunch(数字); SalesAtLaunchYear(数字)
仅“ SalesAtLaunchYear”数据具有一些缺失值,需要进行估算。
对于某些产品,有完整的数据,即存在第1,2年以及直到现在的销售数据。
但是,某些其他产品仅包含自推出以来的最初几年的销售数据。产品的使用期限不同,因此有时自发布以来缺少2年,有时有10年,这取决于产品。
我有兴趣在R中找到一个模型,该模型可以估算缺少的时间序列数据缺口。我通过将“ SalesAtLaunchYear”的模型设置为随机林来尝试了MICE,但是我仍然获得了非常高的销售价值,尤其是在产品推出之初。我确保在第0年,所有销售额均为0,以避免出现负值。数据框有20000行,其中包含300种独特的产品。
testdf = tibble::tribble(
~Country, ~ProductGroup, ~Product, ~YearSinceProductLaunch, ~SalesAtLaunchYear,
"CA", "ProductGroup1", "Product1", 0L, 0,
"CA", "ProductGroup1", "Product1", 1L, NA,
"CA", "ProductGroup1", "Product1", 2L, NA,
"CA", "ProductGroup1", "Product1", 3L, NA,
"CA", "ProductGroup1", "Product1", 4L, NA,
"CA", "ProductGroup1", "Product1", 5L, 206034.9814,
"CA", "ProductGroup1", "Product1", 6L, 170143.2623,
"CA", "ProductGroup1", "Product1", 7L, 212541.9306,
"CA", "ProductGroup1", "Product1", 8L, 270663.199,
"CA", "ProductGroup1", "Product1", 9L, 736738.3755,
"CA", "ProductGroup1", "Product1", 10L, 2579723.981,
"CA", "ProductGroup1", "Product1", 11L, 4964319.496,
"CA", "ProductGroup1", "Product1", 12L, 6864985.16,
"CA", "ProductGroup1", "Product1", 13L, 8793292.386,
"CA", "ProductGroup1", "Product1", 14L, 11416033.38,
"IT", "ProductGroup2", "Product2", 0L, 0,
"IT", "ProductGroup2", "Product2", 1L, NA,
"IT", "ProductGroup2", "Product2", 2L, NA,
"IT", "ProductGroup2", "Product2", 3L, NA,
"IT", "ProductGroup2", "Product2", 4L, NA,
"IT", "ProductGroup2", "Product2", 5L, NA,
"IT", "ProductGroup2", "Product2", 6L, NA,
"IT", "ProductGroup2", "Product2", 7L, NA,
"IT", "ProductGroup2", "Product2", 8L, NA,
"IT", "ProductGroup2", "Product2", 9L, NA,
"IT", "ProductGroup2", "Product2", 10L, NA,
"IT", "ProductGroup2", "Product2", 11L, NA,
"IT", "ProductGroup2", "Product2", 12L, NA,
"IT", "ProductGroup2", "Product2", 13L, 30806222.96,
"IT", "ProductGroup2", "Product2", 14L, 31456272,
"IT", "ProductGroup2", "Product2", 15L, 31853476.78,
"IT", "ProductGroup2", "Product2", 16L, 30379818,
"IT", "ProductGroup2", "Product2", 17L, 29765448.87,
"IT", "ProductGroup2", "Product2", 18L, 31376234,
"IT", "ProductGroup2", "Product2", 19L, 32628514.81,
"IT", "ProductGroup2", "Product2", 20L, 32732196,
"IT", "ProductGroup2", "Product2", 21L, 33503784.25,
"IT", "ProductGroup2", "Product2", 22L, 35163372,
"DE", "ProductGroup3", "Product3", 0L, 0,
"DE", "ProductGroup3", "Product3", 1L, 161884.081,
"DE", "ProductGroup3", "Product3", 2L, 7876925.474,
"DE", "ProductGroup3", "Product3", 3L, 12948209.55,
"DE", "ProductGroup3", "Product3", 4L, 13304401.76
)
testdf$Country = as.factor(testdf$Country)
testdf$ProductGroup = as.factor(testdf$ProductGroup)
testdf$Product = as.factor(testdf$Product)
答案 0 :(得分:0)
可能使用鼠标不会给您想要的结果。由于它主要使用变量间相关性。您正在寻找更多的时间相关性。
对于此特定示例,我的建议是将数据集划分为“国家”,“产品组”,“产品组”,并使用时间序列插补包对这些数据集进行插补。
查看您的数据,我认为诸如 imputeTS 包中的na.interpolation函数之类的事情已经做好了。
这就是你的称呼:
library("imputeTS")
na.interpolation(yourTimeSeries)
对于每个国家,产品组,产品中创建的每个时间序列,您都必须多次调用它。
您还可以运行
na.interpolation(testdf$SalesAtLaunchYear)
在您的整个数据集上,这更容易-在示例中,您也可以这样做。 (如果其余部分的结构不同或您使用的是与imputeTS软件包不同的算法,则可能会导致问题)