Question

我们将巨大的遗留文件放在压缩序列文件格式的hadoop集群中。序列文件是使用hive ETL创建的。假设我使用以下DDL创建了hive表：

CREATE TABLE sequence_table(
col1 string,
col2 int)
stored as sequence file;

以下是用于加载序列表上方的脚本：

set hive.exec.compress.output=true;
set io.seqfile.compression.type=BLOCK;
set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;
insert into table sequence_table
select * from text_table;

现在我们已将序列文件位置内的数据导出到S3进行存档。现在我正在尝试使用AWS EMR中的spark处理这些文件。我如何在spark中读取序列文件。我看了一下示例文件，其中包含如下所示的标题，并且知道该序列文件的<K,V>为<BytesWritable,Text>

SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Text)
org.apache.hadoop.io.compress.SnappyCodec
Ì(KÇØ»Ô:˜t£¾äIrlÿÿÿÿÌ(KÇØ»Ô:˜t£¾äIrlŽ£E  £   =£

我试过这种方式：

val file = sc.sequenceFile(
  "s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",
  classOf[BytesWritable],
  classOf[Text])
file.take(10)

但它会产生此错误：

18/03/09 16:48:07 ERROR TaskSetManager: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: org.apache.hadoop.io.BytesWritable
Serialization stack:
        - object not serializable (class: org.apache.hadoop.io.BytesWritable, value: )
        - field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
        - object (class scala.Tuple2, (,8255909859607837R188956557001505628000150562818115056280001505628181TimerRecord9558SRM-1528454-0PiYo Workout!FRFM1810000002017-09-17 01:29:29))
        - element of array (index: 0)
        - array (class [Lscala.Tuple2;, size 1); not retrying
18/03/09 16:48:07 WARN ExecutorAllocationManager: No stages are running, but numRunningTasks != 0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: org.apache.hadoop.io.BytesWritable

然后我尝试了以下内容，但仍然没有运气：

scala> val data = sc.newAPIHadoopFile("s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",classOf[SequenceFileInputFormat],classOf[BytesWritable],classOf[Text],conf)
<console>:31: error: class SequenceFileInputFormat takes type parameters
       val data = sc.newAPIHadoopFile("s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",classOf[SequenceFileInputFormat],classOf[BytesWritable],classOf[Text],conf)
                                                                                                           ^

scala> val data = sc.newAPIHadoopFile("s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",classOf[SequenceFileInputFormat<BytesWritable,Text>],classOf[BytesWritable],classOf[Text],conf)
<console>:1: error: identifier expected but ']' found.
val data = sc.newAPIHadoopFile("s3://viewershipforneo4j/viewership/event_date=2017-09-17/*",classOf[SequenceFileInputFormat<BytesWritable,Text>],classOf[BytesWritable],classOf[Text],conf)

如何在spark中读取snappy压缩序列文件

0 个答案: