R

时间:2018-11-19 17:15:04

标签: r time-series missing-data imputation r-mice

时间序列数据包含:

产品(分类); ProductGroup(类别);国家(分类); YearSinceProductLaunch(数字); SalesAtLaunchYear(数字)

仅“ SalesAtLaunchYear”数据具有一些缺失值,需要进行估算。

对于某些产品,有完整的数据,即存在第1,2年以及直到现在的销售数据。

但是,某些其他产品仅包含自推出以来的最初几年的销售数据。产品的使用期限不同,因此有时自发布以来缺少2年,有时有10年,这取决于产品。

我有兴趣在R中找到一个模型,该模型可以估算缺少的时间序列数据缺口。我通过将“ SalesAtLaunchYear”的模型设置为随机林来尝试了MICE,但是我仍然获得了非常高的销售价值,尤其是在产品推出之初。我确保在第0年,所有销售额均为0,以避免出现负值。数据框有20000行,其中包含300种独特的产品。

testdf = tibble::tribble(
  ~Country,   ~ProductGroup,   ~Product, ~YearSinceProductLaunch, ~SalesAtLaunchYear,
      "CA", "ProductGroup1", "Product1",                      0L,                  0,
      "CA", "ProductGroup1", "Product1",                      1L,                 NA,
      "CA", "ProductGroup1", "Product1",                      2L,                 NA,
      "CA", "ProductGroup1", "Product1",                      3L,                 NA,
      "CA", "ProductGroup1", "Product1",                      4L,                 NA,
      "CA", "ProductGroup1", "Product1",                      5L,        206034.9814,
      "CA", "ProductGroup1", "Product1",                      6L,        170143.2623,
      "CA", "ProductGroup1", "Product1",                      7L,        212541.9306,
      "CA", "ProductGroup1", "Product1",                      8L,         270663.199,
      "CA", "ProductGroup1", "Product1",                      9L,        736738.3755,
      "CA", "ProductGroup1", "Product1",                     10L,        2579723.981,
      "CA", "ProductGroup1", "Product1",                     11L,        4964319.496,
      "CA", "ProductGroup1", "Product1",                     12L,         6864985.16,
      "CA", "ProductGroup1", "Product1",                     13L,        8793292.386,
      "CA", "ProductGroup1", "Product1",                     14L,        11416033.38,
      "IT", "ProductGroup2", "Product2",                      0L,                  0,
      "IT", "ProductGroup2", "Product2",                      1L,                 NA,
      "IT", "ProductGroup2", "Product2",                      2L,                 NA,
      "IT", "ProductGroup2", "Product2",                      3L,                 NA,
      "IT", "ProductGroup2", "Product2",                      4L,                 NA,
      "IT", "ProductGroup2", "Product2",                      5L,                 NA,
      "IT", "ProductGroup2", "Product2",                      6L,                 NA,
      "IT", "ProductGroup2", "Product2",                      7L,                 NA,
      "IT", "ProductGroup2", "Product2",                      8L,                 NA,
      "IT", "ProductGroup2", "Product2",                      9L,                 NA,
      "IT", "ProductGroup2", "Product2",                     10L,                 NA,
      "IT", "ProductGroup2", "Product2",                     11L,                 NA,
      "IT", "ProductGroup2", "Product2",                     12L,                 NA,
      "IT", "ProductGroup2", "Product2",                     13L,        30806222.96,
      "IT", "ProductGroup2", "Product2",                     14L,           31456272,
      "IT", "ProductGroup2", "Product2",                     15L,        31853476.78,
      "IT", "ProductGroup2", "Product2",                     16L,           30379818,
      "IT", "ProductGroup2", "Product2",                     17L,        29765448.87,
      "IT", "ProductGroup2", "Product2",                     18L,           31376234,
      "IT", "ProductGroup2", "Product2",                     19L,        32628514.81,
      "IT", "ProductGroup2", "Product2",                     20L,           32732196,
      "IT", "ProductGroup2", "Product2",                     21L,        33503784.25,
      "IT", "ProductGroup2", "Product2",                     22L,           35163372,
      "DE", "ProductGroup3", "Product3",                      0L,                  0,
      "DE", "ProductGroup3", "Product3",                      1L,         161884.081,
      "DE", "ProductGroup3", "Product3",                      2L,        7876925.474,
      "DE", "ProductGroup3", "Product3",                      3L,        12948209.55,
      "DE", "ProductGroup3", "Product3",                      4L,        13304401.76
  )


testdf$Country = as.factor(testdf$Country)
testdf$ProductGroup   = as.factor(testdf$ProductGroup)
testdf$Product  = as.factor(testdf$Product)

1 个答案:

答案 0 :(得分:0)

可能使用鼠标不会给您想要的结果。由于它主要使用变量间相关性。您正在寻找更多的时间相关性。

对于此特定示例,我的建议是将数据集划分为“国家”,“产品组”,“产品组”,并使用时间序列插补包对这些数据集进行插补。

查看您的数据,我认为诸如 imputeTS 包中的na.interpolation函数之类的事情已经做好了。

这就是你的称呼:

library("imputeTS")
na.interpolation(yourTimeSeries)

对于每个国家,产品组,产品中创建的每个时间序列,您都必须多次调用它。

您还可以运行

 na.interpolation(testdf$SalesAtLaunchYear)

在您的整个数据集上,这更容易-在示例中,您也可以这样做。 (如果其余部分的结构不同或您使用的是与imputeTS软件包不同的算法,则可能会导致问题)