将Unicode从R写入SQL Server

时间:2018-01-04 23:54:38

标签: sql-server r unicode

我正在尝试将Unicode字符串从R编写到SQL,然后使用该SQL表为Power BI仪表板提供支持。不幸的是,当我将表加载到R中时,Unicode字符似乎只能工作,而不是在我在SSMS或Power BI中查看表时。

require(odbc)
require(DBI)
require(dplyr)
con <- DBI::dbConnect(odbc::odbc(),
                      .connection_string = "DRIVER={ODBC Driver 13 for SQL Server};SERVER=R9-0KY02L01\\SQLEXPRESS;Database=Test;trusted_connection=yes;")
testData <- data_frame(Characters = "❤")
dbWriteTable(con,"TestUnicode",testData,overwrite=TRUE)
result <- dbReadTable(con, "TestUnicode")
result$Characters

成功收益:

> result$Characters
[1] "❤"

但是,当我在SSMS中拉出该表时:

SELECT * FROM TestUnicode

我有两个不同的角色:

Characters
~~~~~~~~~~
â¤

这些角色也是Power BI中出现的角色。如何正确地将心脏角色拉出R?

4 个答案:

答案 0 :(得分:2)

事实证明这是R / DBI / ODBC驱动程序中的某个错误。问题是R将字符串存储为UTF-8编码,而SQL Server将它们存储为UTF-16LE编码。此外,当dbWriteTable创建表时,它默认为字符串创建VARCHAR列,甚至不能保存Unicode字符。因此,你需要两个:

  1. 将R数据框中的列从字符串列更改为UTF-16LE原始字节的列表列。
  2. 使用dbWriteTable时,请将字段类型指定为NVARCHAR(MAX)
  3. 这似乎仍应由DBI或ODBC或其他东西处理。

    require(odbc)
    require(DBI)
    
    # This function takes a string vector and turns it into a list of raw UTF-16LE bytes. 
    # These will be needed to load into SQL Server
    convertToUTF16 <- function(s){
      lapply(s, function(x) unlist(iconv(x,from="UTF-8",to="UTF-16LE",toRaw=TRUE)))
    }
    
    # create a connection to a sql table
    connectionString <- "[YOUR CONNECTION STRING]"
    con <- DBI::dbConnect(odbc::odbc(),
                          .connection_string = connectionString)
    
    # our example data
    testData <- data.frame(ID = c(1,2,3), Char = c("I", "❤","Apples"), stringsAsFactors=FALSE)
    
    # we adjust the column with the UTF-8 strings to instead be a list column of UTF-16LE bytes
    testData$Char <- convertToUTF16(testData$Char)
    
    # write the table to the database, specifying the field type
    dbWriteTable(con, 
                 "UnicodeExample", 
                 testData, 
                 append=TRUE, 
                 field.types = c(Char = "NVARCHAR(MAX)"))
    
    dbDisconnect(con)
    

答案 1 :(得分:0)

last answergithub: r-dbi/DBI#215: Storing unicode characters in SQL Server启发

field.types = c(Char = "NVARCHAR(MAX)")之后,但是由于错误dbReadTable/dbGetQuery returns Invalid Descriptor Index ....而具有向量和max的计算:


vector_nvarchar<-c(Filter(Negate(is.null), 
                              (
                                lapply(testData,function(x){
                                  if (is.character(x) ) c(
                                    names(x),
                                    paste0("NVARCHAR(", 
                                           max(
                                             # nvarchar(max) gave error dbReadTable/dbGetQuery returns Invalid Descriptor Index error on SQL server 
                                             # https://github.com/r-dbi/odbc/issues/112  
                                             # so we compute the max                                           
                                             nchar(
                                               iconv( #nchar doesn't work for UTF-8 :  help (nchar)
                                                 Filter(Negate(is.null),x)
                                                 ,"UTF-8","ASCII",sub ="x" 
                                               )
                                             )
                                             ,na.rm = TRUE)
                                           ,")"
                                    )
                                  )
                                })
                              )
    ))

con= DBI::dbConnect(odbc::odbc(),.connection_string=xxxxt, encoding = 'UTF-8')

DBI::dbWriteTable(con,"UnicodeExample",testData, overwrite= TRUE, append=FALSE, field.types= vector_nvarchar)

 DBI::dbGetQuery(con,iconv('select * from UnicodeExample'))

答案 2 :(得分:0)

受最后一个答案的启发,我还试​​图找到一种自动方式将数据帧写入SQL Server。我无法确认nvarchar(max)错误,所以最终遇到了以下功能:

convertToUTF16_df <- function(df){
  output <- cbind(df[sapply(df, typeof) != "character"]
    , list.cbind(apply(df[sapply(df, typeof) == "character"], 2, function(x){
      return(lapply(x, function(y) unlist(iconv(y, from = "UTF-8", to = "UTF-16LE", toRaw = TRUE))))
    }))

  )[colnames(df)]

  return(output)
}

field_types <- function(df){

  output <- list()
  output[colnames(df)[sapply(df, typeof) == "character"]] <- "nvarchar(max)"

  return(output)
}

DBI::dbWriteTable(odbc_connect
                  , name = SQL("database.schema.table")
                  , value = convertToUTF16_df(df)
                  , overwrite = TRUE
                  , row.names = FALSE
                  , field.types = field_types(df)
)

答案 3 :(得分:0)

我发现之前的答案非常有用,但遇到了使用另一种编码(例如“latin1”而不是 UTF-8)的字符向量的问题。由于特殊字符(例如不间断空格),这会导致数据库列中出现随机 NULL。

为了避免这些编码问题,我进行了以下修改以检测字符向量编码或在转换为 UTF-16LE 之前默认返回 UTF-8:

library(rlist)

convertToUTF16_df <- function(df){
  output <- cbind(df[sapply(df, typeof) != "character"]
                  , list.cbind(apply(df[sapply(df, typeof) == "character"], 2, function(x){
                    return(lapply(x, function(y) {
                        if (Encoding(y)=="unknown") {
                          unlist(iconv(enc2utf8(y), from = "UTF-8", to = "UTF-16LE", toRaw = TRUE))
                        } else {
                          unlist(iconv(y, from = Encoding(y), to = "UTF-16LE", toRaw = TRUE))
                        }
                      }))
                  }))
  )[colnames(df)]

  return(output)
}

field_types <- function(df){

  output <- list()
  output[colnames(df)[sapply(df, typeof) == "character"]] <- "nvarchar(max)"

  return(output)
}

DBI::dbWriteTable(odbc_connect
                  , name = SQL("database.schema.table")
                  , value = convertToUTF16_df(df)
                  , overwrite = TRUE
                  , row.names = FALSE
                  , field.types = field_types(df)
)

理想情况下,我仍然会修改它以删除 rlist 依赖项,但它现在似乎可以工作了。