inner_join()sqlite tibbles上的NA值

时间:2017-09-14 16:07:11

标签: r dplyr dbplyr

我在sqlite tibbles(tbl_sql对象)上进行自然连接时遇到了问题,这些对象包含同名的列,这些列中包含NA值(或我想到的缺失值)。

library(DBI)
library(dplyr)
library(dbplyr)

## modify mtcars for example
modcars <- mtcars
modcars[["NAs"]] <- c(rep(1, 3), rep(NA, 29))

## store modcars in sql table and get it
mydb <- dbConnect(RSQLite::SQLite(), "")
dbWriteTable(mydb, "modcars", modcars)
srcdbi_mydb <- src_dbi(mydb)
tbl_modcars <- tbl(srcdbi_mydb, "modcars")

modcars %>% head
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb NAs
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   1
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4   1
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1   1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1  NA
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2  NA
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1  NA

tbl_modcars %>% head
#> # Source:   lazy query [?? x 12]
#> # Database: sqlite 3.19.3 []
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb   NAs
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4     1
#> 2  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4     1
#> 3  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1     1
#> 4  21.4     6   258   110  3.08 3.215 19.44     1     0     3     1    NA
#> 5  18.7     8   360   175  3.15 3.440 17.02     0     0     3     2    NA
#> 6  18.1     6   225   105  2.76 3.460 20.22     1     0     3     1    NA

请注意,将这两个表连接起来的内部输出存在差异。这是因为dplyr和sqlite如何处理缺失值。

inner_join(modcars, modcars) %>% head
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "NAs")
#>    mpg cyl disp  hp drat    wt  qsec vs am gear carb NAs
#> 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   1
#> 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4   1
#> 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1   1
#> 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1  NA
#> 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2  NA
#> 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1  NA

inner_join(tbl_modcars, tbl_modcars) %>% head
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "NAs")
#> # Source:   lazy query [?? x 12]
#> # Database: sqlite 3.19.3 []
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb   NAs
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4     1
#> 2  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4     1
#> 3  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1     1

实际上,在inner_join() modcars上调用data.frames时,我希望在inner_join()元素上调用tbl_modcars时显示相同的行。

我意识到我可以简单地使用以下代码来获得所需的输出:

joinee1 <- tbl_modcars %>% select(setdiff(colnames(tbl_modcars), "NAs"))
inner_join(joinee1, tbl_modcars) %>% head
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb")
#> # Source:   lazy query [?? x 12]
#> # Database: sqlite 3.19.3 []
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb   NAs
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4     1
#> 2  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4     1
#> 3  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1     1
#> 4  21.4     6   258   110  3.08 3.215 19.44     1     0     3     1    NA
#> 5  18.7     8   360   175  3.15 3.440 17.02     0     0     3     2    NA
#> 6  18.1     6   225   105  2.76 3.460 20.22     1     0     3     1    NA

但是,这忽略了NAs列中任何非NA信息的连接(如果适用)。另外,我只执行一次dplyr调用而不是两次(如果调用太多,则解析器堆栈溢出可能会成为一个问题)。

感谢任何解决方案或澄清。

0 个答案:

没有答案