Question

我在jdk1.8上使用spark和scala。我是Scala的新手。

我正在阅读一个看起来像这样的文本文件（pat1.txt）：

现在，我正在从Scala代码中读取该文件：

val sqlContext = SparkSession.builder().getOrCreate()  
sqlContext.read
    .format(externalEntity.getExtractfileType)
    .option("compression", externalEntity.getCompressionCodec)
    .option("header", if (externalEntity.getHasHeader.toUpperCase == "Y") "true" else "false")
    .option("inferSchema", "true")
    .option("delimiter", externalEntity.getExtractDelimiter)
    .load(externalEntity.getFilePath)
    .createOrReplaceTempView(externalEntity.getExtractName)

然后根据我的scala代码进行查询：

val queryResult = sqlContext.sql(myQuery)

，并且输出生成为：

queryResult
 .repartition(LteGenericExtractEntity.getNumberOfFiles.toInt)
 .write.format("csv")
 .option("compression", LteGenericExtractEntity.getCompressionCodec)
 .option("delimiter", LteGenericExtractEntity.getExtractDelimiter)
 .option("header", "true"")
 .save(s"${outputDirectory}/${extractFileBase}")

现在当上面的“ myQuery”为

时

select * from PAT1

程序正在生成o / p（（注意带有“值”的多余行不属于文件）。基本上，该程序无法识别输入文件中用“，”分隔的列，并且在输出中它在标题为“ value”的标题下创建1列。所以输出文件看起来像：

如果我将“ myQuery”更改为：

select p1.FIRST_NAME, p1.LAST_NAME,p1.HOBBY  from PAT1 p1

它会引发以下异常：

我的输入可以是任何格式（例如可以是text / csv并可以进行压缩），并且输出始终是.csv

我很难理解如何更改读取的部分，以便创建的视图可以适当地包含列。我可以在这方面获得帮助吗？

Answer 1

这看起来像csv文件，但扩展名为.txt。您可以尝试以下操作：

将此文件命名为csv，并带有spark.read.option("inferSchema", "true").option("header", "true").csv("path/to/file")等其他选项
像您一样读取文件后，只需将数据框的架构指定为：

    sqlContext.read.format("text")
          .option("compression", "none")
          .option("delimiter", ",")
          .option("header", "true")
          .load("/tmp/pat1")
          .toDF("first_name", "last_name", "hobby")

从文件读取后，Spark，Scala无法正确创建视图

1 个答案: