Question

我正在尝试使用 dplyr 和 R 来处理 SQL 数据库，我希望能够很好地处理 SQL 的NULL值-通过简单地将它们过滤掉或在出现时将它们视为零（取决于情况）而无需对基础数据库本身进行任何更改。（换句话说，我不是在问关于从SQL内部将所有NULL值转换为零的的问题。）

基本上，我试图使用 dplyr 与 SQL 数据库一起工作，但是我一直得到意想不到的结果。

# Using Lahman's Database, available here: # https://www.kaggle.com/seanlahman/the-history-of-baseball library(dplyr) db.path <- '~/data/SQLite Databases/the-history-of-baseball/database.sqlite' con <- DBI::dbConnect(RSQLite::SQLite(), db.path) batting_db <- tbl(con, 'batting') # the result of this code is at least (seemingly) correct--the columns appear # to be the correct type and the entries shown are all accurate: batting_db %>% filter(hr >= 50) # however, when the additional constraint is added, columns get coerced to # characters and rows where hr == '' start showing up batting_db %>% filter(hr >= 50, year >= 1985)

首先，为什么这甚至是个问题？由于'' >= 50的值为FALSE，为什么不将空字符串过滤掉？（注意：尽管我仍然不明白为什么，但是添加了hr != ''似乎已解决了此问题的其他约束条件。）

此外，至于现在将这些空字符串转换为零，我什至不确定是否有必要，因为显然 dplyr 在计算中将它们视为零（？！）。

# mutate appears to treat these empty strings as '0' in calculations batting_db %>% filter(hr >= 30, year >= 1985) %>% select(player_id:g, h, hr) %>% mutate(hr2 = hr + 5, hr3 = g * hr)

基本上，我只是不了解 dplyr 在使用它访问数据库时的行为，我将不胜感激。

Answer 1

这通过将所有列转换为类型character，用NA替换空字符串并将其转换回来解决上述示例中出现的构造不良的SQL表的问题integer（如果适用）。如果您尝试根据统计信息计算统计信息，那么您当然不希望将缺失值视为零，但是您已经知道了。

library(dplyr)
library(DBI)

db.path <- "database.sqlite"
con <- DBI::dbConnect(RSQLite::SQLite(), db.path)
batting_db <- tbl(con, 'batting')

batting <- batting_db %>% 
  mutate_all(as.character) %>% 
  as_tibble %>% # must be a data frame for na_if to work
  na_if("") %>%  #replace empty strings with NA
  #convert numerics back to numerics
  mutate_at(vars(-one_of(c("player_id","team_id","league_id"))),as.integer)

# add a new table to the database with our clean data
dbWriteTable(con,"batting_mod",batting,overwrite=TRUE)
# back to where we started from but with a clean table
batting_db <- tbl(con, 'batting_mod')
batting_db

# Source:   table<batting_mod> [?? x 22]
# Database: sqlite 3.22.0 [C:\Users\nsteinm\Documents\R\temp\database.sqlite]
   player_id  year stint team_id league_id     g    ab     r     h double triple    hr   rbi    sb
   <chr>     <int> <int> <chr>   <chr>     <int> <int> <int> <int>  <int>  <int> <int> <int> <int>
 1 abercda01  1871     1 TRO     NA            1     4     0     0      0      0     0     0     0
 2 addybo01   1871     1 RC1     NA           25   118    30    32      6      0     0    13     8
 3 allisar01  1871     1 CL1     NA           29   137    28    40      4      5     0    19     3
 4 allisdo01  1871     1 WS3     NA           27   133    28    44     10      2     2    27     1
 5 ansonca01  1871     1 RC1     NA           25   120    29    39     11      3     0    16     6
 6 armstbo01  1871     1 FW1     NA           12    49     9    11      2      1     0     5     0
 7 barkeal01  1871     1 RC1     NA            1     4     0     1      0      0     0     2     0
 8 barnero01  1871     1 BS1     NA           31   157    66    63     10      9     0    34    11
 9 barrebi01  1871     1 FW1     NA            1     5     1     1      1      0     0     1     0
10 barrofr01  1871     1 BS1     NA           18    86    13    13      2      1     0    11     1
# ... with more rows, and 8 more variables: cs <int>, bb <int>, so <int>, ibb <int>, hbp <int>,
#   sh <int>, sf <int>, g_idp <int>

Answer 2

我怀疑但不知道dplyr是否会将SQL数据库中的任何const_cast都转换为NULL。我无法使用NA创建一个数据帧进行测试，因为这不是有效的R构造。我需要看一个例子。如果我们假设将NULL更改为NULL，则本示例将NA设置为零，并处理表的副本，而不修改数据库。

NA

dplyr＆优雅地处理SQL NULL值

2 个答案: