Question

我们假设我有一张桌子：

Date        Sales
09/01/2017  9000
09/02/2017  12000
09/03/2017  0
09/04/2017  11000
09/05/2017  14400
09/06/2017  0
09/07/2017  0
09/08/2017  21000
09/09/2017  15000
09/10/2017  23100
09/11/2017  0
09/12/2017  32000
09/13/2017  8000

表中的值是由我无法访问的R程序估算的（它现在是一个黑盒子）。现在有几天有0值，由于我们的摄取/ ETL过程中的问题，这些值往往会蔓延。我需要用0数据估计日期的值。

我们的方法是：

从缺失数据之前的日期到日期右侧绘制一条线缺少数据后
从行

现在如果只有一天在两个好日子之间缺少数据，那么直截了当的意思就行了。如果连续两天或多天缺少数据，则平均值不起作用，因此我试图制定一种估算多个数据点值的方法。

这种方法在R中有用吗？我在R总共n00b，所以我不确定这是否可行。

Answer 1

您可以使用函数approxfun通过线性插值填充值。

## Your data
df = read.table(text="Date        Sales
09/01/2017  9000
09/02/2017  12000
09/03/2017  0
09/04/2017  11000
09/05/2017  14400
09/06/2017  0
09/07/2017  0
09/08/2017  21000
09/09/2017  15000
09/10/2017  23100
09/11/2017  0
09/12/2017  32000
09/13/2017  8000",
header=TRUE, stringsAsFactors=FALSE)
df$Date = as.Date(df$Date, format="%m/%d/%Y")


## Create function for linear interpolation
Interp = approxfun(df[df$Sales > 0, ])

## Use function to fill in interpolated values
Vals = Interp(df$Date[df$Sales == 0])
df$Sales[df$Sales == 0] = Vals
plot(df, type="l")
grid()

Answer 2

我们还可以使用na.interpolation包中的imputeTS函数。 na.interpolation的默认方法是线性插值，但如果需要，我们也可以指定其他方法。

library(dplyr)
library(imputeTS)

dt2 <- dt %>%
  replace(. == 0, NA) %>%
  mutate(Sales = na.interpolation(Sales))

dt2
         Date Sales
1  09/01/2017  9000
2  09/02/2017 12000
3  09/03/2017 11500
4  09/04/2017 11000
5  09/05/2017 14400
6  09/06/2017 16600
7  09/07/2017 18800
8  09/08/2017 21000
9  09/09/2017 15000
10 09/10/2017 23100
11 09/11/2017 27550
12 09/12/2017 32000
13 09/13/2017  8000

数据

dt <- read.table(text = "Date Sales 09/01/2017 9000 09/02/2017 12000 09/03/2017 0 09/04/2017 11000 09/05/2017 14400 09/06/2017 0 09/07/2017 0 09/08/2017 21000 09/09/2017 15000 09/10/2017 23100 09/11/2017 0 09/12/2017 32000 09/13/2017 8000", header = TRUE, stringsAsFactors = FALSE)

R - 估计缺失值

2 个答案: