Spark com.databricks.spark.csv无法使用node-snappy

时间:2016-09-24 22:24:09

标签: apache-spark pyspark snappy databricks apache-spark-2.0

我在S3上有一些使用snappy压缩算法压缩的csv文件(使用node-snappy包)。我喜欢使用com.databricks.spark.csv在spark中处理这些文件,但我一直收到无效的文件输入错误。

代码:

file_df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true', codec='snappy', mode='FAILFAST').load('s3://sample.csv.snappy')

错误消息:

  

16/09/24 21:57:25 WARN TaskSetManager:阶段0.0中丢失的任务0.0(TID 0,ip-10-0-32-5.ec2.internal):java.lang.InternalError:无法解压缩数据。输入无效。       at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompressBytesDirect(Native Method)       在org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompress(SnappyDecompressor.java:239)       at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:88)       在org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)       在java.io.InputStream.read(InputStream.java:101)       在org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)       在org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)       在org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)       在org.apache.hadoop.mapred.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:208)       在org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:246)       在org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)       在org.apache.spark.rdd.HadoopRDD $$ anon $ 1.getNext(HadoopRDD.scala:255)       在org.apache.spark.rdd.HadoopRDD $$ anon $ 1.getNext(HadoopRDD.scala:209)       在org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)       在org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)       在scala.collection.Iterator $$ anon $ 11.hasNext(Iterator.scala:408)       在scala.collection.Iterator $$ anon $ 13.hasNext(Iterator.scala:461)       在scala.collection.Iterator $$ anon $ 10.hasNext(Iterator.scala:389)       在scala.collection.Iterator $ class.foreach(Iterator.scala:893)       在scala.collection.AbstractIterator.foreach(Iterator.scala:1336)       在scala.collection.generic.Growable $ class。$ plus $ plus $ eq(Growable.scala:59)       在scala.collection.mutable.ArrayBuffer。$ plus $ plus $ eq(ArrayBuffer.scala:104)       在scala.collection.mutable.ArrayBuffer。$ plus $ plus $ eq(ArrayBuffer.scala:48)       在scala.collection.TraversableOnce $ class.to(TraversableOnce.scala:310)       在scala.collection.AbstractIterator.to(Iterator.scala:1336)       在scala.collection.TraversableOnce $ class.toBuffer(TraversableOnce.scala:302)       在scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)       在scala.collection.TraversableOnce $ class.toArray(TraversableOnce.scala:289)       在scala.collection.AbstractIterator.toArray(Iterator.scala:1336)       在org.apache.spark.rdd.RDD $$ anonfun $ take $ 1 $$ anonfun $ 29.apply(RDD.scala:1305)       在org.apache.spark.rdd.RDD $$ anonfun $ take $ 1 $$ anonfun $ 29.apply(RDD.scala:1305)       在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:1897)       在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:1897)       在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)       在org.apache.spark.scheduler.Task.run(Task.scala:85)       在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:274)       在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)       at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)       在java.lang.Thread.run(Thread.java:745)

1 个答案:

答案 0 :(得分:0)

看起来像here回答的问题一样 - 基本上python snappy与Hadoop snappy并不兼容。