尝试使用名称中包含冒号的sc.textFile加载文件时出错

时间:2014-08-13 19:58:24

标签: apache-spark

我在IPython中使用pyspark(Spark 1.0.1)来加载一个名为冒号的gzip压缩文件。我可以在重命名时加载文件,但否则会出错。

命令是:

inputFile = '/vol/data/standard_feed:2014_08_13_15:20140813180721:1:2:92db249b89dbfb8dbad5c5fb0b3b79af.csv.gz'
input = sc.textFile(inputFile).map(loadRecord)
input.count()

我得到以下追溯

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-69-3f1537c7b8bd> in <module>()
----> 1 input.count()

/vol/code/spark/spark-1.0.1/python/pyspark/rdd.pyc in count(self)
    706         3
    707         """
--> 708         return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
    709
    710     def stats(self):

/vol/code/spark/spark-1.0.1/python/pyspark/rdd.pyc in sum(self)
    697         6.0
    698         """
--> 699         return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
    700
    701     def count(self):

/vol/code/spark/spark-1.0.1/python/pyspark/rdd.pyc in reduce(self, f)
    617             if acc is not None:
    618                 yield acc
--> 619         vals = self.mapPartitions(func).collect()
    620         return reduce(f, vals)
    621

/vol/code/spark/spark-1.0.1/python/pyspark/rdd.pyc in collect(self)
    581         """
    582         with _JavaStackTrace(self.context) as st:
--> 583           bytesInJava = self._jrdd.collect().iterator()
    584         return list(self._collect_iterator_through_file(bytesInJava))
    585

/vol/code/spark/spark-1.0.1/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    535         answer = self.gateway_client.send_command(command)
    536         return_value = get_return_value(answer, self.gateway_client,
--> 537                 self.target_id, self.name)                                                                                                                                     
    538
    539         for temp_arg in temp_args:

,错误是

    Py4JJavaError: An error occurred while calling o206.collect.
    : java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI:     standard_feed:2014_08_13_15:20140813180721:1:2:92db249b89dbfb8dbad5c5fb0b3b79af.csv.\
gz
    at org.apache.hadoop.fs.Path.initialize(Path.java:148)
    at org.apache.hadoop.fs.Path.<init>(Path.java:126)
    at org.apache.hadoop.fs.Path.<init>(Path.java:50)
    at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1038)
    at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:987)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:177)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:175)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
    at scala.Option.getOrElse(Option.scala:120)

如何在不重命名的情况下加载此文件?我无法重命名我想用Spark处理的所有文件。

1 个答案:

答案 0 :(得分:1)

Hadoop路径解析器将:解释为协议分隔符。解决方案是明确指定协议。如果是本地文件:

inputFile = 'file:///vol/data/standard_feed:2014_08_13_15:20140813180721:1:2:92db249b89dbfb8dbad5c5fb0b3b79af.csv.gz'
input = sc.textFile(inputFile).map(loadRecord)
input.count()

如果要加载多个文件,请使用通配符:/vol/data/standard_feed*

如果要将Spark用于分布式计算,则需要将文件复制到分布式文件系统(HDFS,S3,GCS等)。然后路径变为s3n://my-bucket/some-directory/*(对于S3)。