使用df.write.jdbc()将数据帧写入SQL Server会产生错误:列具有无法参与列存储索引的数据类型

时间:2018-02-16 19:53:47

标签: sql-server apache-spark pyspark

我在20个节点的集群中使用带有spark 2.2和python 2.7的pyspark。我正在使用df = spark.read.jdbc(...)从云blob存储中将数据加载到数据框中,然后尝试使用df.write.jdbc(...)将其写入我的SQL Server数据库。但是,在写入过程中,我收到以下错误:

py4j.protocol.Py4JJavaError: An error occurred while calling o67.jdbc. : com.microsoft.sqlserver.jdbc.SQLServerException: The statement failed. Column 'my_col' has a data type that cannot participate in a columnstore index.

df的架构如下:

root |-- my_col: string (nullable = true) |-- my_other_col: string (nullable = true) ...

This post让我相信df.write.jdbc(...)可能会在写入时尝试为df中的所有列创建列存储索引。不幸的是,我不知道如何阻止这样做,所以我可以缓解这个问题。

我的代码的摘要版本如下所示:

spark = (pyspark.sql.SparkSession.builder
             .appName('my-app').getOrCreate())

df = (self.spark.read.format('com.databricks.spark.avro')
                      .options(inferSchema=True, header=True)
                      .load(blob_storage_path)).repartition(self.num_partitions)

df.write.jdbc(url=self.jdbc_url, table=table, mode='overwrite',
                          properties=self.jdbc_properties)

这是完整的堆栈跟踪:

py4j.protocol.Py4JJavaError: An error occurred while calling o68.jdbc. 
:  com.microsoft.sqlserver.jdbc.SQLServerException: The statement failed. Column 'my_col' has a data type that cannot participate in a columnstore index.
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:217)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1655)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.doExecuteStatement(SQLServerStatement.java:885)
at com.microsoft.sqlserver.jdbc.SQLServerStatement$StmtExecCmd.doExecute(SQLServerStatement.java:778)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7505)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:2445)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:191)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:166)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeUpdate(SQLServerStatement.java:703)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:805)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:90)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:461)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

提前感谢您提供的任何帮助!

1 个答案:

答案 0 :(得分:0)

你有什么SQL Server版本?对于2017年之前的版本中的列存储索引中的数据类型有一些限制。本文的读取限制部分https://docs.microsoft.com/en-us/sql/t-sql/statements/create-columnstore-index-transact-sql您可以做的事情是删除列存储索引或迁移到Sql Server 2017。