Spark读取CSV文件 - 列值以数字开头,以D / F结束

时间:2017-12-01 01:55:55

标签: csv apache-spark spark-dataframe

我使用spark来读取CSV文件,csv中的字段值之一是91520122094491671D 阅读后,该值为9.152012209449166... 我发现如果一个字符串以数字开头并以D / F结尾,那就是结果 但我需要将数据作为字符串读取 那我该怎么办?

这是CSV文件数据。

tax_file_code|  cus_name|   tax_identification_number

T19915201|  息烽家吉装饰材料店|  91520122094491671D

Scala代码如下:

sparkSession.read.format("com.databricks.spark.csv")
  .option("header", "true") 
  .option("inferSchema", true.toString) 
  .load(getHadoopUri(uri)) 
  .createOrReplaceTempView("t_datacent_cus_temp_guizhou_ds_tmp")

sparkSession.sql(
  s"""
     |  select  cast(tax_file_code as String) as tax_file_code,
     |          cus_name,
     |          cast(tax_identification_number as String) as tax_identification_number
     |  from    t_datacent_cus_temp_guizhou_ds_tmp
  """.stripMargin).createOrReplaceTempView("t_datacent_cus_temp_guizhou_ds")

sparkSession.sql("select * from t_datacent_cus_temp_guizhou_ds").show

结果如下所示。

+-----------------+-----------------+-------------------------+

|tax_file_code    | cus_name        |tax_identification_number|

+-----------------+-----------------+-------------------------+

|    T19915201    |息烽家吉装饰材料店 |     9.152012209449166...|

+-----------------+-----------------+-------------------------+

2 个答案:

答案 0 :(得分:0)

你可以尝试:

sparkSession.sql("select * from t_datacent_cus_temp_guizhou_ds").show(20, False)

将truncate设置为false。如果为true,则超过20个字符的字符串将 被截断,所有单元格将对齐

编辑:

 val x = sparkSession.read
        .option("header", "true")
        .option("inferSchema", "true")
        .csv("....src/main/resources/data.csv")

  x.printSchema()

  x.createOrReplaceTempView("t_datacent_cus_temp_guizhou_ds_tmp")


      sparkSession.sql(
        s"""
           |  select  cast(tax_file_code as String) as tax_file_code,
           |          cus_name,
           |          cast(tax_identification_number as String) as tax_identification_number
           |  from    t_datacent_cus_temp_guizhou_ds_tmp
  """.stripMargin).createOrReplaceTempView("t_datacent_cus_temp_guizhou_ds")

      sparkSession.sql("select * from t_datacent_cus_temp_guizhou_ds").show(truncate = false)

这将输出为:

+-------------+----------+-------------------------+
|tax_file_code|cus_name  |tax_identification_number|
+-------------+----------+-------------------------+
|T19915201    | 息烽家吉装饰材料店|9.1520122094491664E16    |
+-------------+----------+-------------------------+

答案 1 :(得分:0)

听起来像尾随的D / F正在将架构解释器设置为double或float,并且列被截断,因此您将看到指数值

如果您希望所有列都是字符串,请删除

option("inferSchema", true.toString)