如何在pyspark AWS

时间:2018-03-16 00:51:44

标签: python apache-spark amazon-ec2 pyspark

我是新来的火花。我试图从我的主实例读取一个文件,但我收到此错误。经过研究,我发现您需要将数据加载到hdfs或跨群集复制。我无法找到执行其中任何一项的命令。

  

----------------------------------------------- ---------------------------- Py4JJavaError Traceback(最近的电话   最后)in()   ----> 1 ncols = rdd.first()。features.size#数据集的列数(无类)

     

/home/ec2-user/spark/python/pyspark/rdd.pyc in first(self)1359
  ValueError:RDD为空1360“”“    - > 1361 rs = self.take(1)1362 if rs:1363 return rs [0]

     

/home/ec2-user/spark/python/pyspark/rdd.pyc in take(self,num)1311   “”1312项= []    - > 1313 totalParts = self.getNumPartitions()1314 partsScanned = 0 1315

     getNumPartitions中的 /home/ec2-user/spark/python/pyspark/rdd.pyc(个体经营)   2438 2439 def getNumPartitions(self):    - > 2440返回self._prev_jrdd.partitions()。size()2441 2442 @property

     

/home/ec2-user/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py   在调用(self,* args)1131 answer =   self.gateway_client.send_command(command)1132 return_value   = get_return_value(    - > 1133回答,self.gateway_client,self.target_id,self.name)1134 1135 for temp_arg in temp_args:

     

/home/ec2-user/spark/python/pyspark/sql/utils.pyc in deco(* a,** kw)        61 def deco(* a,** kw):        62尝试:   ---> 63返回f(* a,** kw)        64除了py4j.protocol.Py4JJavaError为e:        65 s = e.java_exception.toString()

     

/home/ec2-user/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py   在get_return_value中(answer,gateway_client,target_id,name)       317引发Py4JJavaError(       318“调用{0} {1} {2}时发生错误。\ n”。    - > 319格式(target_id,“。”,名称),值)       320其他:       321提出Py4JError(

     

Py4JJavaError:调用o122.partitions时发生错误。 :   org.apache.hadoop.mapred.InvalidInputException:输入路径没有   存在:file:/home/ec2-user/PR_DATA_35.csv at   org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)     在   org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)     在   org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)     在org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)     在   org.apache.spark.rdd.RDD $$ anonfun $分区$ 2.适用(RDD.scala:252)     在   org.apache.spark.rdd.RDD $$ anonfun $分区$ 2.适用(RDD.scala:250)     在scala.Option.getOrElse(Option.scala:121)at   org.apache.spark.rdd.RDD.partitions(RDD.scala:250)at   org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)     在   org.apache.spark.rdd.RDD $$ anonfun $分区$ 2.适用(RDD.scala:252)     在   org.apache.spark.rdd.RDD $$ anonfun $分区$ 2.适用(RDD.scala:250)     在scala.Option.getOrElse(Option.scala:121)at   org.apache.spark.rdd.RDD.partitions(RDD.scala:250)at   org.apache.spark.api.java.JavaRDDLike $ class.partitions(JavaRDDLike.scala:61)     在   org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     在   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:498)at   py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)at at   py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)at   py4j.Gateway.invoke(Gateway.java:280)at   py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)     在py4j.commands.CallCommand.execute(CallCommand.java:79)at   py4j.GatewayConnection.run(GatewayConnection.java:214)at   java.lang.Thread.run(Thread.java:748)

1 个答案:

答案 0 :(得分:0)

由于您已经在AWS中,因此可能更容易将数据文件存储在s3中,并直接从那里打开它们。