我想从pyspark shell加载一个orc文件来创建一个hadoopRDD。
我在pyspark shell中使用下面的命令:
>>> from pyspark import SparkConf, SparkContext
>>> conf = SparkConf().setMaster("local").setAppName("My App")
>>> sc = SparkContext(conf = conf)
>>> lines =sc.newAPIHadoopFile("/tmp/orcfile","org.apache.hadoop.hive.ql.io.orc.OrcInputFormat","org.apache.hadoop.io.Text","org.apache.hadoop.io.LongWritable")
我收到如下错误:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile.
: java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat cannot be cast to org.apache.hadoop.mapreduce.InputFormat
任何人都可以告诉我为Orc和序列文件做正确的方法。
我能够从hiveContext创建数据框:
>>> orcfile = "/tmp/orcfile"
>>> sqlContext = HiveContext(sc)
>>> read_orclines = sqlContext.read.format("orc").load(orcfile)
>>> read_orclines.first()