我在IPython中使用pyspark(Spark 1.0.1)来加载一个名为冒号的gzip压缩文件。我可以在重命名时加载文件,但否则会出错。
命令是:
inputFile = '/vol/data/standard_feed:2014_08_13_15:20140813180721:1:2:92db249b89dbfb8dbad5c5fb0b3b79af.csv.gz'
input = sc.textFile(inputFile).map(loadRecord)
input.count()
我得到以下追溯
Py4JJavaError Traceback (most recent call last)
<ipython-input-69-3f1537c7b8bd> in <module>()
----> 1 input.count()
/vol/code/spark/spark-1.0.1/python/pyspark/rdd.pyc in count(self)
706 3
707 """
--> 708 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
709
710 def stats(self):
/vol/code/spark/spark-1.0.1/python/pyspark/rdd.pyc in sum(self)
697 6.0
698 """
--> 699 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
700
701 def count(self):
/vol/code/spark/spark-1.0.1/python/pyspark/rdd.pyc in reduce(self, f)
617 if acc is not None:
618 yield acc
--> 619 vals = self.mapPartitions(func).collect()
620 return reduce(f, vals)
621
/vol/code/spark/spark-1.0.1/python/pyspark/rdd.pyc in collect(self)
581 """
582 with _JavaStackTrace(self.context) as st:
--> 583 bytesInJava = self._jrdd.collect().iterator()
584 return list(self._collect_iterator_through_file(bytesInJava))
585
/vol/code/spark/spark-1.0.1/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
535 answer = self.gateway_client.send_command(command)
536 return_value = get_return_value(answer, self.gateway_client,
--> 537 self.target_id, self.name)
538
539 for temp_arg in temp_args:
,错误是
Py4JJavaError: An error occurred while calling o206.collect.
: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: standard_feed:2014_08_13_15:20140813180721:1:2:92db249b89dbfb8dbad5c5fb0b3b79af.csv.\
gz
at org.apache.hadoop.fs.Path.initialize(Path.java:148)
at org.apache.hadoop.fs.Path.<init>(Path.java:126)
at org.apache.hadoop.fs.Path.<init>(Path.java:50)
at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1038)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:987)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:177)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:175)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
如何在不重命名的情况下加载此文件?我无法重命名我想用Spark处理的所有文件。
答案 0 :(得分:1)
Hadoop路径解析器将:
解释为协议分隔符。解决方案是明确指定协议。如果是本地文件:
inputFile = 'file:///vol/data/standard_feed:2014_08_13_15:20140813180721:1:2:92db249b89dbfb8dbad5c5fb0b3b79af.csv.gz'
input = sc.textFile(inputFile).map(loadRecord)
input.count()
如果要加载多个文件,请使用通配符:/vol/data/standard_feed*
如果要将Spark用于分布式计算,则需要将文件复制到分布式文件系统(HDFS,S3,GCS等)。然后路径变为s3n://my-bucket/some-directory/*
(对于S3)。