Spark:使用hadoop-2.6预建的Spark 1.5.2读取S3文件异常

时间:2016-01-27 19:38:08

标签: scala amazon-s3 apache-spark

我正在尝试从基于function f=pw(varargin) for ip=1:numel(varargin) switch class(varargin{ip}) case {'double','logical'} varargin{ip}=@(x)(repmat(varargin{ip},size(x))); case 'function_handle' %do nothing otherwise error('wrong input class') end end c=struct('cnd',varargin(1:2:end),'fcn',varargin(2:2:end)); f=@(x)pweval(x,c); end function y=pweval(x,p) todo=true(size(x)); y=x.*0; for segment=1:numel(p) mask=todo; mask(mask)=logical(p(segment).cnd(x(mask))); y(mask)=p(segment).fcn(x(mask)); todo(mask)=false; end assert(~any(todo)); end 的应用程序中读取现有文件。这是我的片段:

  $('#idTERRITORIAL_8').next().find('.active_default').addClass("red");

我得到例外:

Spark

奇怪,当我尝试sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", "MYKEY") sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", "MYSECRET") val a = sc.textFile("s3://myBucket/TNRealtime/output/2016/01/27/22/45/00/a.txt").map{line => line.split(",")} val b = a.collect // **ERROR** producing statement 中的相同代码段时,我得到了不同的错误:

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: s3://snapdeal-personalization-dev-us-west-2/TNRealtime/output/2016/01/27/22/45/00/a.txt
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:909)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:908)
    at com.snapdeal.pears.trending.TrendingDecay$.load(TrendingDecay.scala:68)

任何人都可以帮助我理解这个问题。

2 个答案:

答案 0 :(得分:1)

我不确定你的场景是什么,但是当我在本地运行Spark并希望访问S3上的文件时,我在s3路径中指定了密钥和密钥,如下所示:

sc.textFile("s3://MYKEY:MYSECRET@myBucket/TNRealtime/output/2016/01/27/22/45/00/a.txt")

也许这对你也有用。

答案 1 :(得分:1)

尝试将s3替换为s3n,这是一种新协议。