我在sqlite tibbles(tbl_sql
对象)上进行自然连接时遇到了问题,这些对象包含同名的列,这些列中包含NA值(或我想到的缺失值)。
library(DBI)
library(dplyr)
library(dbplyr)
## modify mtcars for example
modcars <- mtcars
modcars[["NAs"]] <- c(rep(1, 3), rep(NA, 29))
## store modcars in sql table and get it
mydb <- dbConnect(RSQLite::SQLite(), "")
dbWriteTable(mydb, "modcars", modcars)
srcdbi_mydb <- src_dbi(mydb)
tbl_modcars <- tbl(srcdbi_mydb, "modcars")
modcars %>% head
#> mpg cyl disp hp drat wt qsec vs am gear carb NAs
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 NA
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 NA
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 NA
tbl_modcars %>% head
#> # Source: lazy query [?? x 12]
#> # Database: sqlite 3.19.3 []
#> mpg cyl disp hp drat wt qsec vs am gear carb NAs
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
#> 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
#> 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
#> 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 NA
#> 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 NA
#> 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 NA
请注意,将这两个表连接起来的内部输出存在差异。这是因为dplyr和sqlite如何处理缺失值。
inner_join(modcars, modcars) %>% head
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "NAs")
#> mpg cyl disp hp drat wt qsec vs am gear carb NAs
#> 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
#> 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
#> 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
#> 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 NA
#> 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 NA
#> 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 NA
inner_join(tbl_modcars, tbl_modcars) %>% head
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "NAs")
#> # Source: lazy query [?? x 12]
#> # Database: sqlite 3.19.3 []
#> mpg cyl disp hp drat wt qsec vs am gear carb NAs
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
#> 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
#> 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
实际上,在inner_join()
modcars
上调用data.frames
时,我希望在inner_join()
元素上调用tbl_modcars
时显示相同的行。
我意识到我可以简单地使用以下代码来获得所需的输出:
joinee1 <- tbl_modcars %>% select(setdiff(colnames(tbl_modcars), "NAs"))
inner_join(joinee1, tbl_modcars) %>% head
#> Joining, by = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb")
#> # Source: lazy query [?? x 12]
#> # Database: sqlite 3.19.3 []
#> mpg cyl disp hp drat wt qsec vs am gear carb NAs
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
#> 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
#> 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
#> 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 NA
#> 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 NA
#> 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 NA
但是,这忽略了NAs
列中任何非NA信息的连接(如果适用)。另外,我只执行一次dplyr调用而不是两次(如果调用太多,则解析器堆栈溢出可能会成为一个问题)。
感谢任何解决方案或澄清。