我们将巨大的遗留文件放在压缩序列文件格式的hadoop集群中。序列文件是使用hive ETL创建的。假设我使用以下DDL创建了hive表:
CREATE TABLE sequence_table(
col1 string,
col2 int)
stored as sequence file;
以下是用于加载序列表上方的脚本:
set hive.exec.compress.output=true;
set io.seqfile.compression.type=BLOCK;
set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;
insert into table sequence_table
select * from text_table;
现在我们已将序列文件位置内的数据导出到S3进行存档。现在我正在尝试使用AWS EMR中的spark处理这些文件。我如何在spark中读取序列文件。我看了一下示例文件,其中包含如下所示的标题,并且知道该序列文件的<K,V>
为<BytesWritable,Text>
SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Text)
org.apache.hadoop.io.compress.SnappyCodec
Ì(KÇØ»Ô:˜t£¾äIrlÿÿÿÿÌ(KÇØ»Ô:˜t£¾äIrlŽ£E £ =£
我试过这种方式:
val file = sc.sequenceFile(
"s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",
classOf[BytesWritable],
classOf[Text])
file.take(10)
但它会产生此错误:
18/03/09 16:48:07 ERROR TaskSetManager: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: org.apache.hadoop.io.BytesWritable
Serialization stack:
- object not serializable (class: org.apache.hadoop.io.BytesWritable, value: )
- field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
- object (class scala.Tuple2, (,8255909859607837R188956557001505628000150562818115056280001505628181TimerRecord9558SRM-1528454-0PiYo Workout!FRFM1810000002017-09-17 01:29:29))
- element of array (index: 0)
- array (class [Lscala.Tuple2;, size 1); not retrying
18/03/09 16:48:07 WARN ExecutorAllocationManager: No stages are running, but numRunningTasks != 0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: org.apache.hadoop.io.BytesWritable
然后我尝试了以下内容,但仍然没有运气:
scala> val data = sc.newAPIHadoopFile("s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",classOf[SequenceFileInputFormat],classOf[BytesWritable],classOf[Text],conf)
<console>:31: error: class SequenceFileInputFormat takes type parameters
val data = sc.newAPIHadoopFile("s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",classOf[SequenceFileInputFormat],classOf[BytesWritable],classOf[Text],conf)
^
scala> val data = sc.newAPIHadoopFile("s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",classOf[SequenceFileInputFormat<BytesWritable,Text>],classOf[BytesWritable],classOf[Text],conf)
<console>:1: error: identifier expected but ']' found.
val data = sc.newAPIHadoopFile("s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",classOf[SequenceFileInputFormat<BytesWritable,Text>],classOf[BytesWritable],classOf[Text],conf)