java.io.IOException:不完整的HDFS URI,没有主机:在AWS

时间:2017-03-09 20:13:57

标签: python apache-spark amazon-ec2 hdfs pyspark

我第一次使用AWS,并且已将我的文件存储在AWS上。这是我到目前为止用来读取文件的内容。

artist_data = sc.textFile('hdfs:///<aws_server>:<port>/home/ubuntu/artist_stuff/_artist_data')

也尝试过:

artist_data = sc.textFile('hdfs:////home/ubuntu/artist_stuff/_artist_data')

然后我就做了我的RDD:

artist_data = artist_data.map(lambda line:line.encode("ascii", "ignore").strip().split()).filter(lambda line: len(line) > 1)

每次运行artist_data.collect()时都会出现此错误。

当我刚尝试sc.textFile("file:///home/ubuntu/artist_stuff/_artist_data")时,我得到一个不同的错误:InvalidInputException:输入路径不存在:file:/ home / ubuntu / Assignment_2 / _artist_data我猜是因分区或其他原因导致的错误。因此我选择将其编码为hdfs:///

这是完整的错误 - 日志:

Py4JJavaError Traceback(最近一次调用最后一次)  in()

----> 1 artist_data.collect()

/home/ubuntu/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.py in collect(self)

774         """
775         with SCCallSiteSync(self.context) as css:

- &GT; 776 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())     777返回列表(_load_from_socket(port,self._jrdd_deserializer))     778

调用中的/home/ubuntu/anaconda3/lib/python3.5/site-packages/py4j/java_gateway.py(self,* args)

  1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:
get_return_value中的/home/ubuntu/anaconda3/lib/python3.5/site-packages/py4j/protocol.py(answer,gateway_client,target_id,name)

    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.collectAndServe时发生错误。 :java.io.IOException:不完整的HDFS URI,没有主机:hdfs:/ home / ubuntu / artist_stuff / _artist_data     在org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143)     在org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)     在org.apache.hadoop.fs.FileSystem.access $ 200(FileSystem.java:92)     在org.apache.hadoop.fs.FileSystem $ Cache.getInternal(FileSystem.java:2687)     在org.apache.hadoop.fs.FileSystem $ Cache.get(FileSystem.java:2669)     在org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)     在org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)     在org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)     在org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)     在org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)     在org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)     在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply(RDD.scala:248)     在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply(RDD.scala:246)     在scala.Option.getOrElse(Option.scala:121)     在org.apache.spark.rdd.RDD.partitions(RDD.scala:246)     在org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)     在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply(RDD.scala:248)     在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply(RDD.scala:246)     在scala.Option.getOrElse(Option.scala:121)     在org.apache.spark.rdd.RDD.partitions(RDD.scala:246)     在org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:53)     在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply(RDD.scala:248)     在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply(RDD.scala:246)     在scala.Option.getOrElse(Option.scala:121)     在org.apache.spark.rdd.RDD.partitions(RDD.scala:246)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)     在org.apache.spark.rdd.RDD $$ anonfun $ collect $ 1.apply(RDD.scala:893)     在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)     在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112)     在org.apache.spark.rdd.RDD.withScope(RDD.scala:358)     在org.apache.spark.rdd.RDD.collect(RDD.scala:892)     在org.apache.spark.api.python.PythonRDD $ .collectAndServe(PythonRDD.scala:453)     在org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:606)     at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)     在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)     在py4j.Gateway.invoke(Gateway.java:280)     at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)     在py4j.commands.CallCommand.execute(CallCommand.java:79)     在py4j.GatewayConnection.run(GatewayConnection.java:211)     在java.lang.Thread.run(Thread.java:745)

0 个答案:

没有答案