在R上连接数据集的问题

时间:2019-01-17 10:12:35

标签: r dataframe merge

我有一个包含变量和已售商品数量的数据集:但是,有几天没有值。

我创建了一个数据集,其中所有销售额均为0,其余均为NA。如何将这些行添加到初始数据集中?

此刻,我有这个:

sales
day    month    year    employees    holiday    sales
1      1        2018    14           0          1058
2      1        2018    25           1          2174 
4      1        2018    11           0          987

sales.NA
day    month    year    employees    holiday    sales
1      1        2018    NA           NA         0
2      1        2018    NA           NA         0
3      1        2018    NA           NA         0
4      1        2018    NA           NA         0

我想创建一个新的数据集,插入我没有观测值的天,销售额为0以及所有其他变量的NA。像这样

new.data
day    month    year    employees    holiday    sales
1      1        2018    14           0          1058
2      1        2018    25           1          2174 
3      1        2018    NA           NA         0
4      1        2018    11           0          987

我尝试用过这样的东西

merge(sales.NA,sales, all.y=T, by = c("day","month","year"))

但这不起作用

3 个答案:

答案 0 :(得分:1)

使用dplyr,您可以使用“ right_join”。例如:

sales <- data.frame(day = c(1,2,4), 
                    month = c(1,1,1),
                    year = c(2018, 2018, 2018),
                    employees = c(14, 25, 11), 
                    holiday = c(0,1,0), 
                    sales = c(1058, 2174, 987)
                    )

sales.NA <- data.frame(day = c(1,2,3,4),
                       month = c(1,1,1,1), 
                       year = c(2018,2018,2018, 2018)
                       )

right_join(sales, sales.NA)

这使您拥有

  day month year employees holiday sales
1   1     1 2018        14       0  1058
2   2     1 2018        25       1  2174
3   3     1 2018        NA      NA    NA
4   4     1 2018        11       0   987

这将NA保留在您希望为0的销售中,但是可以通过在sales.NA中包含销售数据来解决,也可以使用“ tidyr”

right_join(sales, sales.NA) %>% mutate(sales = replace_na(sales, 0))

答案 1 :(得分:0)

这是使用data.table包的答案,因为我对语法更加熟悉,但是常规data.frames的工作原理几乎相同。我还将切换到适当的日期格式,这将使您的生活更加轻松。 实际上,以这种方式,您将不需要Sales.NA表,因为它将在第一次加入后具有NA的所有天自动解决。

library(data.table)


dt.dates <- data.table(Date = seq.Date(from = as.Date("2018-01-01"), to = as.Date("2018-12-31"),by = "day"  ))
dt.sales <- data.table(day = c(1,2,4)
                       , month = c(1,1,1)
                       , year = c(2018,2018,2018)
                       , employees = c(14, 25, 11)
                       , holiday = c(0,1,0)
                       , sales = c(1058, 2174, 987)
                       )


dt.sales[, Date := as.Date(paste(year,month,day, sep = "-")) ]

merge( x = dt.dates
       , y = dt.sales
       , by.x = "Date"
       , by.y = "Date"
       , all.x = TRUE
)
>             Date day month year employees holiday sales
    1: 2018-01-01   1     1 2018        14       0  1058
    2: 2018-01-02   2     1 2018        25       1  2174
    3: 2018-01-03  NA    NA   NA        NA      NA    NA
    4: 2018-01-04   4     1 2018        11       0   987
...

答案 2 :(得分:0)

这是另一个data.table解决方案:

jvars = c("day","month","year")
merge(sales.NA[, ..jvars], sales, by = jvars, all.x = TRUE)[is.na(sales), sales := 0L][]

   day month year employees holiday sales
1:   1     1 2018        14       0  1058
2:   2     1 2018        25       1  2174
3:   3     1 2018        NA      NA     0
4:   4     1 2018        11       0   987

或使用一些更简洁的语法:

sales[sales.NA[, ..jvars], on = jvars][is.na(sales), sales := 0][]

可重现数据:

sales <- structure(list(day = c(1L, 2L, 4L), month = c(1L, 1L, 1L), year = c(2018L, 
2018L, 2018L), employees = c(14L, 25L, 11L), holiday = c(0L, 
1L, 0L), sales = c(1058L, 2174L, 987L)), row.names = c(NA, -3L
), class = c("data.table", "data.frame"))
sales.NA <- structure(list(day = 1:4, month = c(1L, 1L, 1L, 1L), year = c(2018L, 
2018L, 2018L, 2018L), employees = c(NA, NA, NA, NA), holiday = c(NA, 
NA, NA, NA), sales = c(0L, 0L, 0L, 0L)), row.names = c(NA, -4L
), class = c("data.table", "data.frame"))