我正在尝试将Unicode字符串从R编写到SQL,然后使用该SQL表为Power BI仪表板提供支持。不幸的是,当我将表加载到R中时,Unicode字符似乎只能工作,而不是在我在SSMS或Power BI中查看表时。
require(odbc)
require(DBI)
require(dplyr)
con <- DBI::dbConnect(odbc::odbc(),
.connection_string = "DRIVER={ODBC Driver 13 for SQL Server};SERVER=R9-0KY02L01\\SQLEXPRESS;Database=Test;trusted_connection=yes;")
testData <- data_frame(Characters = "❤")
dbWriteTable(con,"TestUnicode",testData,overwrite=TRUE)
result <- dbReadTable(con, "TestUnicode")
result$Characters
成功收益:
> result$Characters
[1] "❤"
但是,当我在SSMS中拉出该表时:
SELECT * FROM TestUnicode
我有两个不同的角色:
Characters
~~~~~~~~~~
â¤
这些角色也是Power BI中出现的角色。如何正确地将心脏角色拉出R?
答案 0 :(得分:2)
事实证明这是R / DBI / ODBC驱动程序中的某个错误。问题是R将字符串存储为UTF-8编码,而SQL Server将它们存储为UTF-16LE编码。此外,当dbWriteTable创建表时,它默认为字符串创建VARCHAR列,甚至不能保存Unicode字符。因此,你需要两个:
这似乎仍应由DBI或ODBC或其他东西处理。
require(odbc)
require(DBI)
# This function takes a string vector and turns it into a list of raw UTF-16LE bytes.
# These will be needed to load into SQL Server
convertToUTF16 <- function(s){
lapply(s, function(x) unlist(iconv(x,from="UTF-8",to="UTF-16LE",toRaw=TRUE)))
}
# create a connection to a sql table
connectionString <- "[YOUR CONNECTION STRING]"
con <- DBI::dbConnect(odbc::odbc(),
.connection_string = connectionString)
# our example data
testData <- data.frame(ID = c(1,2,3), Char = c("I", "❤","Apples"), stringsAsFactors=FALSE)
# we adjust the column with the UTF-8 strings to instead be a list column of UTF-16LE bytes
testData$Char <- convertToUTF16(testData$Char)
# write the table to the database, specifying the field type
dbWriteTable(con,
"UnicodeExample",
testData,
append=TRUE,
field.types = c(Char = "NVARCHAR(MAX)"))
dbDisconnect(con)
答案 1 :(得分:0)
受last answer和github: r-dbi/DBI#215: Storing unicode characters in SQL Server启发
在field.types = c(Char = "NVARCHAR(MAX)")
之后,但是由于错误dbReadTable/dbGetQuery returns Invalid Descriptor Index ....而具有向量和max的计算:
vector_nvarchar<-c(Filter(Negate(is.null),
(
lapply(testData,function(x){
if (is.character(x) ) c(
names(x),
paste0("NVARCHAR(",
max(
# nvarchar(max) gave error dbReadTable/dbGetQuery returns Invalid Descriptor Index error on SQL server
# https://github.com/r-dbi/odbc/issues/112
# so we compute the max
nchar(
iconv( #nchar doesn't work for UTF-8 : help (nchar)
Filter(Negate(is.null),x)
,"UTF-8","ASCII",sub ="x"
)
)
,na.rm = TRUE)
,")"
)
)
})
)
))
con= DBI::dbConnect(odbc::odbc(),.connection_string=xxxxt, encoding = 'UTF-8')
DBI::dbWriteTable(con,"UnicodeExample",testData, overwrite= TRUE, append=FALSE, field.types= vector_nvarchar)
DBI::dbGetQuery(con,iconv('select * from UnicodeExample'))
答案 2 :(得分:0)
受最后一个答案的启发,我还试图找到一种自动方式将数据帧写入SQL Server。我无法确认nvarchar(max)错误,所以最终遇到了以下功能:
convertToUTF16_df <- function(df){
output <- cbind(df[sapply(df, typeof) != "character"]
, list.cbind(apply(df[sapply(df, typeof) == "character"], 2, function(x){
return(lapply(x, function(y) unlist(iconv(y, from = "UTF-8", to = "UTF-16LE", toRaw = TRUE))))
}))
)[colnames(df)]
return(output)
}
field_types <- function(df){
output <- list()
output[colnames(df)[sapply(df, typeof) == "character"]] <- "nvarchar(max)"
return(output)
}
DBI::dbWriteTable(odbc_connect
, name = SQL("database.schema.table")
, value = convertToUTF16_df(df)
, overwrite = TRUE
, row.names = FALSE
, field.types = field_types(df)
)
答案 3 :(得分:0)
我发现之前的答案非常有用,但遇到了使用另一种编码(例如“latin1”而不是 UTF-8)的字符向量的问题。由于特殊字符(例如不间断空格),这会导致数据库列中出现随机 NULL。
为了避免这些编码问题,我进行了以下修改以检测字符向量编码或在转换为 UTF-16LE 之前默认返回 UTF-8:
library(rlist)
convertToUTF16_df <- function(df){
output <- cbind(df[sapply(df, typeof) != "character"]
, list.cbind(apply(df[sapply(df, typeof) == "character"], 2, function(x){
return(lapply(x, function(y) {
if (Encoding(y)=="unknown") {
unlist(iconv(enc2utf8(y), from = "UTF-8", to = "UTF-16LE", toRaw = TRUE))
} else {
unlist(iconv(y, from = Encoding(y), to = "UTF-16LE", toRaw = TRUE))
}
}))
}))
)[colnames(df)]
return(output)
}
field_types <- function(df){
output <- list()
output[colnames(df)[sapply(df, typeof) == "character"]] <- "nvarchar(max)"
return(output)
}
DBI::dbWriteTable(odbc_connect
, name = SQL("database.schema.table")
, value = convertToUTF16_df(df)
, overwrite = TRUE
, row.names = FALSE
, field.types = field_types(df)
)
理想情况下,我仍然会修改它以删除 rlist 依赖项,但它现在似乎可以工作了。