将具有字符串列的数据集写入teradata

时间:2018-01-02 10:33:51

标签: apache-spark spark-dataframe teradata

当我在数据集中有一些字符串数据的同时尝试将数据集从spark写入teradata时,我遇到了错误:

2018-01-02 15:49:05 [pool-2-thread-2] ERROR c.i.i.t.spark2.algo.JDBCTableWriter:115 - Error in JDBC operation:
java.sql.SQLException: [Teradata Database] [TeraJDBC 15.00.00.20] [Error 3706] [SQLState 42000] Syntax error: Data Type "TEXT" does not match a Defined Type name.
      at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDatabaseSQLException(ErrorFactory.java:308)
    at com.teradata.jdbc.jdbc_4.statemachine.ReceiveInitSubState.action(ReceiveInitSubState.java:109)
    at com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.subStateMachine(StatementReceiveState.java:307)
    at com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.action(StatementReceiveState.java:196)
    at com.teradata.jdbc.jdbc_4.statemachine.StatementController.runBody(StatementController.java:123)
    at com.teradata.jdbc.jdbc_4.statemachine.StatementController.run(StatementController.java:114)
    at com.teradata.jdbc.jdbc_4.TDStatement.executeStatement(TDStatement.java:385)
    at com.teradata.jdbc.jdbc_4.TDStatement.doNonPrepExecuteUpdate(TDStatement.java:602)
    at com.teradata.jdbc.jdbc_4.TDStatement.executeUpdate(TDStatement.java:1109)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:805)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:90)
    at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:472)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)

如何确保将数据正确写入teradata。

我正在将HDFS中的csv文件读入数据集,然后尝试使用DataFrameWriter将其写入Teradata。我使用下面给出的代码:

ds.write().mode("append")
            .jdbc(url, tableName, props);

我使用的是spark 2.2.0,而Teradata是15.00.00.07 当我尝试写入Nettezza时,我遇到了类似的问题,而在DB2中我可以编写,但字符串值正在被替换。 写入这些数据库时是否需要任何选项..?

2 个答案:

答案 0 :(得分:0)

我能够通过为Teradata实现自定义JDBCDialect来解决这个问题。 可以使用相同的方法来解决与Netezza,DB2,Hive等其他数据源类似的问题。

为此,您需要扩展'JdbcDialect'类并注册它:

public class TDDialect extends JdbcDialect {

private static final Map<String, Option<JdbcType>> dataTypeMap = new HashMap<String, Option<JdbcType>>();

static {
    dataTypeMap
            .put("int", Option.apply(JdbcType.apply("INTEGER",
                    java.sql.Types.INTEGER)));
    dataTypeMap.put("long",
            Option.apply(JdbcType.apply("BIGINT", java.sql.Types.BIGINT)));
    dataTypeMap.put("double", Option.apply(JdbcType.apply(
            "DOUBLE PRECISION", java.sql.Types.DOUBLE)));
    dataTypeMap.put("float",
            Option.apply(JdbcType.apply("FLOAT", java.sql.Types.FLOAT)));
    dataTypeMap.put("short", Option.apply(JdbcType.apply("SMALLINT",
            java.sql.Types.SMALLINT)));
    dataTypeMap
            .put("byte", Option.apply(JdbcType.apply("BYTEINT",
                    java.sql.Types.TINYINT)));
    dataTypeMap.put("binary",
            Option.apply(JdbcType.apply("BLOB", java.sql.Types.BLOB)));
    dataTypeMap.put("timestamp", Option.apply(JdbcType.apply("TIMESTAMP",
            java.sql.Types.TIMESTAMP)));
    dataTypeMap.put("date",
            Option.apply(JdbcType.apply("DATE", java.sql.Types.DATE)));
    dataTypeMap.put("string", Option.apply(JdbcType.apply("VARCHAR(255)",
            java.sql.Types.VARCHAR)));
    dataTypeMap.put("boolean",
            Option.apply(JdbcType.apply("CHAR(1)", java.sql.Types.CHAR)));
    dataTypeMap.put("text", Option.apply(JdbcType.apply("VARCHAR(255)",
            java.sql.Types.VARCHAR)));
}

/***/
private static final long serialVersionUID = 1L;

@Override
public boolean canHandle(String url) {
    return url.startsWith("jdbc:teradata");
}

@Override
public Option<JdbcType> getJDBCType(DataType dt) {
    Option<JdbcType> option = dataTypeMap.get(dt.simpleString().toLowerCase());
    if(option == null){
        option = Option.empty();
    }
    return option;
}

}

现在,您可以在调用任何针对spark:

的Action之前使用下面的代码片段注册此内容
JdbcDialects.registerDialect(new TDDialect());

对于某些数据源,例如Hive,您可能需要重写一个方法以避免NumberFormatExceptions或一些类似的异常:

@Override
public String quoteIdentifier(String colName) {
    return colName;
}

希望这可以帮助任何面临类似问题的人。

答案 1 :(得分:0)

它为我工作,请你尝试一次让我知道吗?

Point to be noted:
***Your hive table must be in Text format as storage. It should not be ORC.
Create the schema in Teradata before writing it from your pyspark notebook.***




df = spark.sql("select * from dbname.tableName")
properties = {
"driver": "com.teradata.jdbc.TeraDriver",
"user": "xxxx",
"password": "xxxxx"
}

df.write.jdbc(url='provide_url',table='dbName.tableName', properties=properties)