dtplyr对group-filter-select的错误翻译

时间:2019-11-19 21:08:48

标签: r dplyr dtplyr

使用dplyr可以轻松执行group-filter-select。在下面的示例中,我们有一些今年不同季度的公司数据。我现在想过滤掉没有第四季度数据的第一季度公司(在本例中为第二家公司),删除季度标签。

df <- data.frame(companyId = c(rep(1, 4),
                               rep(2, 3),
                               rep(3, 4)),
                 Quarter = c(1:4, 1:3, 1:4),
                 Year = 2019)

q <- 4                 

df %>%
  group_by(
    companyId,
  ) %>%
  filter(
    Quarter == 1 &
      !(q %in% Quarter)
  ) %>%
  select(companyId,
         Year)

> # A tibble: 1 x 3
> # Groups:   companyId, Ticker [1]
>   companyId  Year
>       <dbl> <dbl>
> 1         2  2019

但是,对dtplyr进行相同操作会返回一个空表:

dt <- lazy_dt(data.table(companyId = c(rep(1, 4),
                                       rep(2, 3),
                                       rep(3, 4)),
                         Quarter = c(1:4, 1:3, 1:4),
                         Year = 2019))

q <- 4

dt %>%
  group_by(
    companyId
  ) %>%
  filter(
    Quarter == 1 &
      !(q %in% Quarter)
  ) %>%
  select(companyId
         Year)

> Source: local data table [?? x 3]
> Call:   `_DT1`[Quarter == 1 & !(q %in% Quarter), .(companyId, 
>     Year)]
> 
> # ... with 3 variables: companyId <dbl>, Year <dbl>
> 
> # Use as.data.table()/as.data.frame()/as_tibble() to access results

显示的翻译很奇怪:

`_DT1`[Quarter == 1 & !(q %in% Quarter),
       .(companyId, Year)]

这是不正确的。如dtplyr的own docs中所述,正确的调用将需要使用经过过滤的.SD

`_DT1`[, .SD[Quarter == 1 & !(q %in% Quarter)],
       by = .(companyId),
       .SDcols = c("Year")]

(自动添加了副列,因此.SDcols应当省略它们以避免重复)

有趣的是,如果我们省略select,则翻译(并因此输出)是正确的:

dt %>%
  group_by(
    companyId
  ) %>%
  filter(
    Quarter == 1 &
      !(q %in% Quarter)
  )

> Source: local data table [?? x 4]
> Call:   `_DT2`[, .SD[Quarter == 1 & !(q %in% Quarter)], 
>     keyby = .(companyId)]
> 
>   companyId Quarter  Year
>       <dbl>   <int> <dbl>
> 1         2       1  2019

因此,作为一种解决方法,我可以在as.data.table()之前执行select。这可行,但是会发出恼人的警告:

dt %>%
  group_by(
    companyId
  ) %>%
  filter(
    calendarQuarter == 1 &
      !(q %in% calendarQuarter)
  ) %>%
  as.data.table() %>%
  select(companyId,
         calendarYear)

>    companyId calendarYear
> 1:         2         2019
> Warning message:
> You are using a dplyr method on a raw data.table, which will call the data frame implementation,
> and is likely to be inefficient.
> * 
> * To suppress this message, either generate a data.table translation with `lazy_dt()` or convert
> * to a data frame or tibble with `as.data.frame()`/`as_tibble()`.

我很难想到这是预期的行为,但想在将其扔到dtplyr Github跟踪器上之前先检查一下。

1 个答案:

答案 0 :(得分:0)

这是try: import xgboost except ImportError: pass except: print("xgboost is installed...but failed to load!") pass 中的错误。我已将其发布到package's Github