SQL在数据帧上时无法解析给定的输入列

时间:2018-11-06 14:29:55

标签: scala apache-spark

  • 平台:IntelliJ Edition 2018.2.4(社区版)
  • SDK:1.8.0_144
  • 操作系统:Windows 7

作为将来的毕业生,我正在执行第一个大数据任务,并且面临一个问题:

代码

//Loading my csv file here
val df = spark.read
  .format("csv")
  .option("header", "true")
  .option("delimiter",";")
  .load("/user/sfrtech/dilan/yesterdaycsv.csv")
  .toDF()


//Select required columns
val formatedDf = df.select("`TcRun.ID`", "`Td.Name`", "`TcRun.Startdate`", "`TcRun.EndDate`", "`O.Sim.MsisdnVoice`", "`T.Sim.MsisdnVoice`", "`ErrorCause`")

//Sql on DF in order to get useful data
formatedDf.createOrReplaceTempView("yesterday")
val sqlDF = spark.sql("" +
  " SELECT TcRun.Id, Td.Name, TcRun.Startdate, TcRun.EndDate, SUBSTR(O.Sim.MsisdnVoice,7,14) as MsisdnO, SUBSTR(T.Sim.MsisdnVoice,7,14) as MsisdnT", ErrorCause +
  " FROM yesterday" +
  " WHERE Td.Name like '%RING'" +
  " AND MsisdnO is not null" +
  " AND MsisdnT is not null" +
  " AND ErrorCause = 'NoError'")

遇到错误

  

线程“ main”中的异常org.apache.spark.sql.AnalysisException:给定输入列,无法解析“ Td.Name”:[TcRun.EndDate,TcRun.Startdate,O.Sim.MsisdnVoice,TcRun.ID ,Td.Name,T.Sim.MsisdnVoice,ErrorCause];第1行pos 177;

我猜问题出在我的包含“。”的列名中。但即使我使用反引号,我也不知道该如何解决

解决方案

val newColumns = Seq("id", "name", "startDate", "endDate", "msisdnO", "msisdnT", "error")
val dfRenamed = df.toDF(newColumns: _*)

dfRenamed.printSchema
// root
// |-- id: string (nullable = false)
// |-- name: string (nullable = false)
// |-- startDate: string (nullable = false)
// |-- endDate: string(nullable = false)
// |-- msisdnO: string (nullable = false)
// |-- msisdnT: string (nullable = false)
// |-- error: string (nullable = false)

3 个答案:

答案 0 :(得分:2)

这行得通,

val sqlDF = spark.sql("" +
  " SELECT 'TcRun.Id', 'Td.Name', 'TcRun.Startdate', 'TcRun.EndDate'", ErrorCause +
  " FROM yesterday" +
  " WHERE 'Td.Name' like '%RING'" +
  " AND MsisdnO is not null" +
  " AND MsisdnT is not null" +
  " AND ErrorCause = 'NoError'")

当字段名称中包含.字符时,请在select子句中使用引号。

答案 1 :(得分:0)

// Define column names of csv without "."
val schema = StructType(Array(
        StructField("id", StringType, true),
        StructField("name", StringType, true),
        // etc. etc. )

// Load csv file without headers and specify your schema
val df = spark.read
  .format("csv")
  .option("header", "false")
  .option("delimiter",";")
  .schema(schema)
  .load("/user/sfrtech/dilan/yesterdaycsv.csv")
  .toDF()

然后根据需要选择列

df
  .select ($"id", $"name", /*etc etc*/)

答案 2 :(得分:0)

对于包含。(点)的列名,可以使用`字符将列名括起来。

df.select('Td.Name')

我遇到了类似的问题,这种解决方案对我来说很有效。

参考: DataFrame columns names conflict with .(dot)