我正在R中使用各种类型的多重插补程序包测试多个time series data-sets with significant holes (missing values)
。我能够使用Hmisc
和MICE
成功进行测试。但是,尽管这似乎是三种方法中最简单的一种,但我无法运行missForest
方法。
示例:
我有一个data.frame df_final
有2列:
day_of_year (1,2,3,....365 -> 365 integer values, no NA)
bookings (279 integer values, 86 NA values)
我的目标是用missForest填充86个NA值。
这是我的代码
final.imp <- missForest(df_final, verbose = TRUE)
final.imp$OOBerror
final.imp$error
imputed_df <- final.imp$ximp
这怎么可能?我的两列都具有相同的长度= 365。 如果错误是由于NA值引起的,则该算法无法达到其目的。 我一定做错了。
该代码与虹膜数据集完美配合。
编辑:添加dput()的结果
> dput(df_final)
structure(list(day_of_year = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42,
43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58,
59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74,
75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104,
105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117,
118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130,
131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156,
157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169,
170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182,
183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195,
196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208,
209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221,
222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234,
235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247,
248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260,
261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273,
274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286,
287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299,
300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312,
313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325,
326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338,
339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351,
352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364,
365), bookings = c(6L, 12L, 17L, 0L, 2L, NA, 19L, 25L, 28L, 47L,
43L, 31L, NA, 10L, 32L, 23L, 55L, 39L, 21L, NA, 10L, 23L, 23L,
56L, 52L, 33L, NA, 19L, 29L, 39L, 69L, 48L, 32L, NA, 21L, 28L,
49L, 63L, 51L, 27L, NA, 18L, 25L, 54L, 64L, 61L, 22L, NA, 11L,
18L, 25L, 13L, 20L, 14L, NA, 31L, 34L, 28L, 47L, 32L, 14L, NA,
16L, 26L, 49L, 46L, 54L, 22L, NA, 26L, 32L, 44L, 64L, 55L, 34L,
NA, 18L, 60L, 52L, 55L, 50L, 20L, NA, 7L, 11L, 23L, 13L, 7L,
NA, NA, 1L, 5L, 16L, 36L, 55L, 19L, NA, 17L, 32L, 52L, 50L, 69L,
21L, NA, 28L, 37L, 57L, 73L, 65L, 36L, NA, 26L, 16L, 41L, 60L,
58L, 63L, NA, 7L, NA, 17L, 36L, 67L, 31L, NA, 20L, 32L, 54L,
60L, 8L, NA, NA, 26L, 31L, 70L, 34L, 2L, 4L, NA, NA, 18L, 17L,
41L, 73L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 0L, 31L, 11L, 17L, 26L, 14L,
2L, 14L, 16L, 10L, 15L, 17L, 6L, 7L, 17L, 5L, 5L, 14L, 46L, 11L,
8L, 11L, 12L, 3L, 12L, 19L, 8L, 3L, 10L, 19L, 6L, 9L, 35L, 17L,
9L, 27L, 36L, 11L, 14L, 18L, 10L, 12L, 11L, 18L, 22L, 26L, 14L,
NA, 12L, 20L, 38L, 39L, 39L, 19L, NA, 29L, 25L, 36L, 46L, 55L,
27L, NA, 15L, 20L, 39L, 47L, 58L, 35L, NA, 23L, 26L, 30L, 53L,
78L, 29L, NA, 37L, 28L, 38L, 59L, 73L, 21L, NA, 28L, 23L, 35L,
66L, 54L, 53L, NA, 40L, 15L, 26L, 28L, 29L, 13L, NA, 12L, 30L,
27L, 30L, 31L, 23L, NA, 43L, 27L, 29L, 79L, 62L, 30L, NA, 36L,
25L, 51L, 55L, 55L, 32L, NA, 21L, 20L, 56L, 50L, 60L, 43L, 27L,
NA, 27L, 22L, 39L, 48L, 67L, 25L, NA, 31L, 23L, 56L, 58L, 56L,
22L, NA, 22L, 33L, 51L, 30L, 53L, 15L, NA, 9L, 15L, 41L, 36L,
47L, 14L, NA, 10L, 11L, 38L, 40L, 53L, 12L, NA, 11L, 23L, 26L,
52L, 39L, 18L, NA, 5L, 19L, 24L, 27L, 13L, 10L, NA, NA, NA, 7L,
7L, NA, 3L, NA, NA)), row.names = c(NA, -365L), class = c("tbl_df",
"tbl", "data.frame"))
>
不知道为什么预订值显示为双数字。
但是它们的数据类型是整数。
> typeof(df_final$bookings)
[1] "integer"