Question

我第一次使用AWS，并且已将我的文件存储在AWS上。这是我到目前为止用来读取文件的内容。

artist_data = sc.textFile('hdfs:///<aws_server>:<port>/home/ubuntu/artist_stuff/_artist_data')

也尝试过：

artist_data = sc.textFile('hdfs:////home/ubuntu/artist_stuff/_artist_data')

然后我就做了我的RDD：

artist_data = artist_data.map(lambda line:line.encode("ascii", "ignore").strip().split()).filter(lambda line: len(line) > 1)

每次运行artist_data.collect（）时都会出现此错误。

当我刚尝试sc.textFile("file:///home/ubuntu/artist_stuff/_artist_data")时，我得到一个不同的错误：InvalidInputException：输入路径不存在：file：/ home / ubuntu / Assignment_2 / _artist_data我猜是因分区或其他原因导致的错误。因此我选择将其编码为hdfs:///

这是完整的错误 - 日志：

Py4JJavaError Traceback（最近一次调用最后一次） in（）

----> 1 artist_data.collect()

/home/ubuntu/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.py in collect（self）

774         """
775         with SCCallSiteSync(self.context) as css:

- ＆GT; 776 port = self.ctx._jvm.PythonRDD.collectAndServe（self._jrdd.rdd（）） 777返回列表（_load_from_socket（port，self._jrdd_deserializer）） 778

调用中的/home/ubuntu/anaconda3/lib/python3.5/site-packages/py4j/java_gateway.py(self，* args）

  1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

get_return_value中的/home/ubuntu/anaconda3/lib/python3.5/site-packages/py4j/protocol.py（answer，gateway_client，target_id，name）

    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError：调用z：org.apache.spark.api.python.PythonRDD.collectAndServe时发生错误。：java.io.IOException：不完整的HDFS URI，没有主机：hdfs：/ home / ubuntu / artist_stuff / _artist_data 在org.apache.hadoop.hdfs.DistributedFileSystem.initialize（DistributedFileSystem.java:143）在org.apache.hadoop.fs.FileSystem.createFileSystem（FileSystem.java:2653）在org.apache.hadoop.fs.FileSystem.access $ 200（FileSystem.java:92）在org.apache.hadoop.fs.FileSystem $ Cache.getInternal（FileSystem.java:2687）在org.apache.hadoop.fs.FileSystem $ Cache.get（FileSystem.java:2669）在org.apache.hadoop.fs.FileSystem.get（FileSystem.java:371）在org.apache.hadoop.fs.Path.getFileSystem（Path.java:295）在org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus（FileInputFormat.java:258）在org.apache.hadoop.mapred.FileInputFormat.listStatus（FileInputFormat.java:229）在org.apache.hadoop.mapred.FileInputFormat.getSplits（FileInputFormat.java:315）在org.apache.spark.rdd.HadoopRDD.getPartitions（HadoopRDD.scala：200）在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply（RDD.scala：248）在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply（RDD.scala：246）在scala.Option.getOrElse（Option.scala：121）在org.apache.spark.rdd.RDD.partitions（RDD.scala：246）在org.apache.spark.rdd.MapPartitionsRDD.getPartitions（MapPartitionsRDD.scala：35）在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply（RDD.scala：248）在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply（RDD.scala：246）在scala.Option.getOrElse（Option.scala：121）在org.apache.spark.rdd.RDD.partitions（RDD.scala：246）在org.apache.spark.api.python.PythonRDD.getPartitions（PythonRDD.scala：53）在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply（RDD.scala：248）在org.apache.spark.rdd.RDD $$ anonfun $ partitions $ 2.apply（RDD.scala：246）在scala.Option.getOrElse（Option.scala：121）在org.apache.spark.rdd.RDD.partitions（RDD.scala：246）在org.apache.spark.SparkContext.runJob（SparkContext.scala：1911）在org.apache.spark.rdd.RDD $$ anonfun $ collect $ 1.apply（RDD.scala：893）在org.apache.spark.rdd.RDDOperationScope $ .withScope（RDDOperationScope.scala：151）在org.apache.spark.rdd.RDDOperationScope $ .withScope（RDDOperationScope.scala：112）在org.apache.spark.rdd.RDD.withScope（RDD.scala：358）在org.apache.spark.rdd.RDD.collect（RDD.scala：892）在org.apache.spark.api.python.PythonRDD $ .collectAndServe（PythonRDD.scala：453）在org.apache.spark.api.python.PythonRDD.collectAndServe（PythonRDD.scala） at sun.reflect.NativeMethodAccessorImpl.invoke0（Native Method） at sun.reflect.NativeMethodAccessorImpl.invoke（NativeMethodAccessorImpl.java:57） at sun.reflect.DelegatingMethodAccessorImpl.invoke（DelegatingMethodAccessorImpl.java:43） at java.lang.reflect.Method.invoke（Method.java:606） at py4j.reflection.MethodInvoker.invoke（MethodInvoker.java:237）在py4j.reflection.ReflectionEngine.invoke（ReflectionEngine.java:357）在py4j.Gateway.invoke（Gateway.java:280） at py4j.commands.AbstractCommand.invokeMethod（AbstractCommand.java:128）在py4j.commands.CallCommand.execute（CallCommand.java:79）在py4j.GatewayConnection.run（GatewayConnection.java:211）在java.lang.Thread.run（Thread.java:745）

java.io.IOException：不完整的HDFS URI，没有主机：在AWS

0 个答案: