如何将标题添加到csv表scala spark

时间:2019-07-30 23:27:10

标签: scala azure csv apache-spark apache-spark-sql

我正在尝试从csv文件中的表中读取数据。它没有标题,因此当我尝试使用Spark SQL查询表时,所有结果均为null。

我尝试创建一个架构结构,当我执行printschema()时它会显示,但是当我尝试使用(select * from tableName时它却不起作用时,所有值都为null。我也尝试过使用StructType().add( colName )而不是StructField来产生相同的结果。

        val schemaStruct1 = StructType(
            StructField( "AgreementVersionID", IntegerType, true )::
            StructField( "ProgramID", IntegerType, true )::
            StructField( "AgreementID", IntegerType, true )::
            StructField( "AgreementVersionNumber", IntegerType, true )::
            StructField( "AgreementStatusID", IntegerType, true )::
            StructField( "AgreementEffectiveDate", DateType, true )::
            StructField( "AgreementEffectiveDateDay", IntegerType, true )::
            StructField( "AgreementEndDate", DateType, true )::
            StructField( "AgreementEndDateDay", IntegerType, true )::
            StructField( "MasterAgreementNumber", IntegerType, true )::
            StructField( "MasterAgreementEffectiveDate", DateType, true )::
            StructField( "MasterAgreementEffectiveDateDay", IntegerType, true )::
            StructField( "MasterAgreementEndDate", DateType, true )::
            StructField( "MasterAgreementEndDateDay", IntegerType, true )::
            StructField( "SalesContactName", StringType, true )::
            StructField( "RevenueSubID", IntegerType, true )::
            StructField( "LicenseAgreementContractTypeID", IntegerType, true )::Nil
        )

        val df1 = session.read
            .option( "header", true )
            .option( "delimiter", "," )
            .schema( schemaStruct1 )
            .csv( LicenseAgrmtMaster )
        df1.printSchema()
        df1.createOrReplaceTempView( "LicenseAgrmtMaster" )

Printing this schema gives me this schema which is correct

root
 |-- AgreementVersionID: integer (nullable = true)
 |-- ProgramID: integer (nullable = true)
 |-- AgreementID: integer (nullable = true)
 |-- AgreementVersionNumber: integer (nullable = true)
 |-- AgreementStatusID: integer (nullable = true)
 |-- AgreementEffectiveDate: date (nullable = true)
 |-- AgreementEffectiveDateDay: integer (nullable = true)
 |-- AgreementEndDate: date (nullable = true)
 |-- AgreementEndDateDay: integer (nullable = true)
 |-- MasterAgreementNumber: integer (nullable = true)
 |-- MasterAgreementEffectiveDate: date (nullable = true)
 |-- MasterAgreementEffectiveDateDay: integer (nullable = true)
 |-- MasterAgreementEndDate: date (nullable = true)
 |-- MasterAgreementEndDateDay: integer (nullable = true)
 |-- SalesContactName: string (nullable = true)
 |-- RevenueSubID: integer (nullable = true)
 |-- LicenseAgreementContractTypeID: integer (nullable = true)

这是正确的,但是尝试查询它给我一个表,该表仅产生空值,即使该表未填充空值也是如此。我需要能够读取此表才能加入另一个表以完成存储过程

1 个答案:

答案 0 :(得分:1)

我建议您执行以下步骤,然后您可以根据需要更改代码

val df = session.read.option( "delimiter", "," ).csv("<Path of your file/dir>")
val colum_names = Seq("name","id")// this is example define exact number of columns
val dfWithHeader = df.toDF(colum_names:_*)
// now you have header here and data should be also here check the type or you can cast