Question

我有一个包含变量和已售商品数量的数据集：但是，有几天没有值。

我创建了一个数据集，其中所有销售额均为0，其余均为NA。如何将这些行添加到初始数据集中？

此刻，我有这个：

sales
day    month    year    employees    holiday    sales
1      1        2018    14           0          1058
2      1        2018    25           1          2174 
4      1        2018    11           0          987

sales.NA
day    month    year    employees    holiday    sales
1      1        2018    NA           NA         0
2      1        2018    NA           NA         0
3      1        2018    NA           NA         0
4      1        2018    NA           NA         0

我想创建一个新的数据集，插入我没有观测值的天，销售额为0以及所有其他变量的NA。像这样

new.data
day    month    year    employees    holiday    sales
1      1        2018    14           0          1058
2      1        2018    25           1          2174 
3      1        2018    NA           NA         0
4      1        2018    11           0          987

我尝试用过这样的东西

merge(sales.NA,sales, all.y=T, by = c("day","month","year"))

但这不起作用

Answer 1

使用dplyr，您可以使用“ right_join”。例如：

sales <- data.frame(day = c(1,2,4), 
                    month = c(1,1,1),
                    year = c(2018, 2018, 2018),
                    employees = c(14, 25, 11), 
                    holiday = c(0,1,0), 
                    sales = c(1058, 2174, 987)
                    )

sales.NA <- data.frame(day = c(1,2,3,4),
                       month = c(1,1,1,1), 
                       year = c(2018,2018,2018, 2018)
                       )

right_join(sales, sales.NA)

这使您拥有

  day month year employees holiday sales
1   1     1 2018        14       0  1058
2   2     1 2018        25       1  2174
3   3     1 2018        NA      NA    NA
4   4     1 2018        11       0   987

这将NA保留在您希望为0的销售中，但是可以通过在sales.NA中包含销售数据来解决，也可以使用“ tidyr”

right_join(sales, sales.NA) %>% mutate(sales = replace_na(sales, 0))

Answer 2

这是使用data.table包的答案，因为我对语法更加熟悉，但是常规data.frames的工作原理几乎相同。我还将切换到适当的日期格式，这将使您的生活更加轻松。实际上，以这种方式，您将不需要Sales.NA表，因为它将在第一次加入后具有NA的所有天自动解决。

library(data.table)


dt.dates <- data.table(Date = seq.Date(from = as.Date("2018-01-01"), to = as.Date("2018-12-31"),by = "day"  ))
dt.sales <- data.table(day = c(1,2,4)
                       , month = c(1,1,1)
                       , year = c(2018,2018,2018)
                       , employees = c(14, 25, 11)
                       , holiday = c(0,1,0)
                       , sales = c(1058, 2174, 987)
                       )


dt.sales[, Date := as.Date(paste(year,month,day, sep = "-")) ]

merge( x = dt.dates
       , y = dt.sales
       , by.x = "Date"
       , by.y = "Date"
       , all.x = TRUE
)
>             Date day month year employees holiday sales
    1: 2018-01-01   1     1 2018        14       0  1058
    2: 2018-01-02   2     1 2018        25       1  2174
    3: 2018-01-03  NA    NA   NA        NA      NA    NA
    4: 2018-01-04   4     1 2018        11       0   987
...

Answer 3

这是另一个data.table解决方案：

jvars = c("day","month","year")
merge(sales.NA[, ..jvars], sales, by = jvars, all.x = TRUE)[is.na(sales), sales := 0L][]

   day month year employees holiday sales
1:   1     1 2018        14       0  1058
2:   2     1 2018        25       1  2174
3:   3     1 2018        NA      NA     0
4:   4     1 2018        11       0   987

或使用一些更简洁的语法：

sales[sales.NA[, ..jvars], on = jvars][is.na(sales), sales := 0][]

可重现数据：

sales <- structure(list(day = c(1L, 2L, 4L), month = c(1L, 1L, 1L), year = c(2018L, 
2018L, 2018L), employees = c(14L, 25L, 11L), holiday = c(0L, 
1L, 0L), sales = c(1058L, 2174L, 987L)), row.names = c(NA, -3L
), class = c("data.table", "data.frame"))
sales.NA <- structure(list(day = 1:4, month = c(1L, 1L, 1L, 1L), year = c(2018L, 
2018L, 2018L, 2018L), employees = c(NA, NA, NA, NA), holiday = c(NA, 
NA, NA, NA), sales = c(0L, 0L, 0L, 0L)), row.names = c(NA, -4L
), class = c("data.table", "data.frame"))

在R上连接数据集的问题

3 个答案: