数据框中不存在的列名称spark

时间:2016-05-04 23:09:27

标签: scala apache-spark-sql

我目前正在使用spark流媒体并从json中的kafka获取数据。 我将我的rdd转换为dataframe并将其注册为表。执行此操作后,当我触发数据框中不存在列名称的查询时,它会抛出类似

的错误
"'No such struct field currency in price, recipientId;'"

HEre is my query
val selectQuery = "lower(serials.brand) as brandname, lower(appname) as appname, lower(serials.pack) as packname, lower(serials.asset) as assetname, date_format(eventtime, 'yyyy-MM-dd HH:00:00') as eventtime, lower(eventname) as eventname, lower(client.OSName) as platform, lower(eventorigin) as eventorigin, meta.price as price, client.ip as ip, lower(meta.currency) as currency, cast(meta.total as int) as count"

Here is my dataframe
DataFrame[addedTime: bigint, appName: string, client: struct<ip:string>, eventName: string, eventOrigin: string, eventTime: string, geoLocation: string, location: string, meta: struct<period:string,total:string>, serials: struct<asset:string,brand:string,pack:string>, userId: string]>

现在我的json并不严格,有些时候可能没有钥匙。如果数据框中没有键或列,我该如何安全地绕过此异常?

2 个答案:

答案 0 :(得分:2)

您可以使用df.columns来检查列。获取列名称和数据类型df.schema的方法有很多种。您还可以记录架构df.printSchema()

答案 1 :(得分:0)

所以我找到的唯一方法是为你的json创建json模式,然后使用该模式将你的json解析为datafrmae

val df = sqlcontext.read.schema(schema).json(rdd)