使用sqlcontext.sql(...)

时间:2019-06-27 11:51:07

标签: json apache-spark hive pyspark hive-serde

我在Zeppelin笔记本中有一个pyspark脚本,该脚本指向一个位于BLOB存储中的JSON文件,以推断JSON模式并在Hive中创建一个外部表。

我可以从脚本中打印出SQL命令,并在单独的段落中执行它,并且表创建得很好,但是当我尝试通过sqlcontext.sql()方法创建表时,出现以下错误; < / p>

AnalysisException:u'org.apache.hadoop.hive.ql.metadata.HiveException:java.lang.RuntimeException:MetaException(message:java.lang.ClassNotFoundException类org.openx.data.jsonserde.JsonSerDe不是找到);'

谷歌搜索此错误只会弹出页面,以确保SerDe的JAR文件在服务器上,这很明显,因为我可以手动创建此表。下面是我的脚本;

%spark2.pyspark

import os
import datetime as dt
import time
from datetime import date
from pyspark.sql.functions import monotonically_increasing_id, lit
from pyspark.sql.types import *
from pyspark.sql import *
from pyspark.sql.functions import split, lower, unix_timestamp, from_unixtime

hiveDbName = 'dev_phoenix'
hiveTableName = 'et_engagement_cac'
serdeName = 'org.openx.data.jsonserde.JsonSerDe'
jsonFileLocation = 'wasbs://blah-blah-blah@meh-meh-meh.blob.core.windows.net/dev/data/Engagement'

jsonDf = sqlContext.read.json("wasbs://blah-blah-blah@meh-meh-meh.blob.core.windows.net/dev/data/Engagement/Engagement.json")

# jsonDf.printSchema()

extTableDDL = "create external table " + hiveDbName + "." + hiveTableName + "(\n"

for col in jsonDf.dtypes:
    extTableDDL += '`' + col[0] + '` ' + col[1].replace('_id','`_id`') + ',\n'

extTableDDL = extTableDDL[:-2]
extTableDDL += ')\nrow format serde \'' + serdeName + '\'\n'
extTableDDL += 'location \'' + jsonFileLocation + '\'\n'
extTableDDL += 'tblproperties (\'serialization.null.format\'=\'\')'

print extTableDDL

sqlContext.sql(extTableDDL)

我故意混淆了我们的WASB容器名称,因此就这样了。

我发现一些帖子使我开始认为可以使用sqlcontext.sql创建的表类型受到限制,也许我想做的事不可能吗?

当我取出SerDe声明时,我能够成功创建表,但是Hive使用了默认的SerDe,该默认SerDe不适用于基础文件中的数据。

1 个答案:

答案 0 :(得分:0)

好的,所以我想我知道发生了什么,以及如何解决它。我怀疑我要使用的SerDe的JAR文件位于服务器上的目录中,该目录不在classpath变量中。

因此,我第一次调用spark.sql(...)以添加JAR,现在它可以工作了。请参阅下面的更新脚本;

%spark2.pyspark

import os
import datetime as dt
import time
from datetime import date
from pyspark.sql.functions import monotonically_increasing_id, lit
from pyspark.sql.types import *
from pyspark.sql import *
from pyspark.sql.functions import split, lower, unix_timestamp, from_unixtime

hiveDbName = 'dev_phoenix'
hiveTableName = 'et_engagement_cac'
serdeName = 'org.openx.data.jsonserde.JsonSerDe'
jsonFileLocation = 'wasbs://blah-blah-blah@meh-meh-meh.blob.core.windows.net/dev/data/Engagement'

jsonDf = spark.read.json("wasbs://blah-blah-blah@meh-meh-meh.blob.core.windows.net/dev/data/Engagement/Engagement.json")

# jsonDf.printSchema()

spark.sql('add jar /usr/hdp/current/hive-client/lib/json-serde-1.3.8-jar-with-dependencies.jar')

extTableDDL = "create external table " + hiveDbName + "." + hiveTableName + "(\n"

for col in jsonDf.dtypes:
    extTableDDL += '`' + col[0] + '` ' + col[1].replace('_id','`_id`').replace('_class','`_class`') + ',\n'

extTableDDL = extTableDDL[:-2]
extTableDDL += ')\nROW FORMAT SERDE\n'
extTableDDL += '   \'' + serdeName + '\'\n'
extTableDDL += 'STORED AS INPUTFORMAT\n'
extTableDDL += '   \'org.apache.hadoop.mapred.TextInputFormat\'\n'
extTableDDL += 'OUTPUTFORMAT\n'
extTableDDL += '   \'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat\'\n'

extTableDDL += 'location \'' + jsonFileLocation + '\'\n'
extTableDDL += 'tblproperties (\'serialization.null.format\'=\'\')'

print extTableDDL

spark.sql(extTableDDL)