使用na.rm的melt.data.table作为measure.vars

时间:2018-05-09 13:02:16

标签: r list data.table melt na.rm

我想探索melt data.table na.rm的最佳方式,其中measure.vars仅适用于data.table列表的第一个元素。

我有library(data.table) library(lubridate) dt.master <- data.table(user = seq(1,5), visit_id = c(2,4,NA,4,8), visit_date = c(dmy("10/02/2018"), dmy("11/04/2018"), NA, dmy("02/03/2018"), NA), offer_id = c(1,3,NA,NA,NA), offer_date = c(dmy("15/02/2018"), dmy("18/04/2018"), NA, NA, NA)) 如下:

dt.master

使用 user visit_id visit_date offer_id offer_date 1: 1 2 2018-02-10 1 2018-02-15 2: 2 4 2018-04-11 3 2018-04-18 3: 3 NA <NA> NA <NA> 4: 4 4 2018-03-02 NA <NA> 5: 5 8 <NA> NA <NA>

dt.melted <- melt(dt.master,
                  id.vars = "user",
                  measure.vars = list(c("visit_id", "offer_id"), c("visit_date", "offer_date")),
                  variable.name = "level",
                  value.name = c("level_id", "level_date"))

我希望为每个用户提供商业活动的“故事”(即:他们的访问和他们的优惠)。

dt.melted

使用 user level level_id level_date 1: 1 1 2 2018-02-10 2: 2 1 4 2018-04-11 3: 3 1 NA <NA> 4: 4 1 4 2018-03-02 5: 5 1 8 <NA> 6: 1 2 1 2018-02-15 7: 2 2 3 2018-04-18 8: 3 2 NA <NA> 9: 4 2 NA <NA> 10: 5 2 NA <NA>

NA

但是,我不希望level_id出现在 user level level_id level_date 1: 1 1 2 2018-02-10 2: 2 1 4 2018-04-11 3: 4 1 4 2018-03-02 4: 5 1 8 <NA> 5: 1 2 1 2018-02-15 6: 2 2 3 2018-04-18 列中,即:

level_date

不幸的是,样本的数据质量非常糟糕,因此na.rm = T并不总是可用。因此,dt.melted.na <- melt(dt.master, id.vars = "user", measure.vars = list(c("visit_id", "offer_id"), c("visit_date", "offer_date")), variable.name = "level", value.name = c("level_id", "level_date"), na.rm = TRUE) 无效,我会得到:

dt.melted.na

使用 user level level_id level_date 1: 1 1 2 2018-02-10 2: 2 1 4 2018-04-11 3: 4 1 4 2018-03-02 4: 1 2 1 2018-02-15 5: 2 2 3 2018-04-18

na.rm = TRUE

有没有办法只将measure.vars用于visit_date中列表的第一个元素?我正在探索其他解决方法(例如填充offer_datevisit_idoffer_id可用时,'example-sql' => array( 'sqlauth:SQL', 'dsn' => 'pgsql:host=sql.example.org;port=5432;dbname=simplesaml', 'username' => 'simplesaml', 'password' => 'secretpassword', 'query' => 'SELECT uid, givenName, email, eduPersonPrincipalName FROM users WHERE uid = :username AND password = SHA2(CONCAT((SELECT salt FROM users WHERE uid = :username), :password),256);', ), 带有“假”日期,但我想知道是否有一个优雅的解决方案。

1 个答案:

答案 0 :(得分:1)

优雅的解决方案是,na.rm的{​​{1}}参数将采用布尔值的向量,melt()列表中的每个元素对应一个,例如,

measure.vars

由于此功能尚未实现,另一种方法是将重新整形后的缺失行添加到melt(dt.master, id.vars = "user", measure.vars = list(c("visit_id", "offer_id"), c("visit_date", "offer_date")), variable.name = "level", value.name = c("level_id", "level_date"), na.rm = c(TRUE, FALSE)) # not possible with data.table v1.11.0 的长格式。由于问题的大小和内存限制,OP必须使用pointed out na.rm = TRUE

na.rm = TRUE
rbind(
  dt.melted.na,
  dt.master[!is.na(visit_id) & is.na(visit_date), .(user, level = 1L, level_id = visit_id)],
  dt.master[!is.na(offer_id) & is.na(offer_date), .(user, level = 2L, level_id = offer_id)],
  fill = TRUE
)

这种方法相当笨拙且冗长,但可能有助于克服内存限制。对于缺失的行,它本质上是“手工”的重塑。

还有另一种可能不那么冗长的替代方案:

   user level level_id level_date
1:    1     1        2 2018-02-10
2:    2     1        4 2018-04-11
3:    4     1        4 2018-03-02
4:    1     2        1 2018-02-15
5:    2     2        3 2018-04-18
6:    5     1        8       <NA>

此处,所有行都从incomplete_rows <- melt(dt.master[!is.na(visit_id) & is.na(visit_date) | !is.na(offer_id) & is.na(offer_date)], id.vars = "user", measure.vars = list(c("visit_id", "offer_id"), c("visit_date", "offer_date")), variable.name = "level", value.name = c("level_id", "level_date"))[!is.na(level_id)] rbind( dt.melted.na, incomplete_rows ) 中挑选出来,这些行部分不完整,重新整形为长格式并随后进行过滤。如果这仅涉及dt.master的一小部分行,则这可能也会在内存有限的情况下起作用。