在我当地的pyspark失踪了什么?

时间:2016-12-29 17:03:21

标签: apache-spark pyspark

我刚刚开始学习pyspark,这里似乎是一个showstopper:我试图将本地文本文件加载到spark:

base_df = sqlContext.read.text("/root/Downloads/SogouQ1.txt")
  

16/12/29 11:55:20 INFO text.TextRelation:在驱动程序上列出hdfs:// localhost:9000 / root / Downloads / SogouQ1.txt

base_df.show(10)
  

16/12/29 11:55:36 INFO storage.MemoryStore:阻止broadcast_2存储   作为内存中的值(估计大小61.8 KB,免费78.0 KB)16/12/29   11:55:36 INFO storage.MemoryStore:阻止broadcast_2_piece0存储为   内存中的字节数(估计大小为19.6 KB,空闲97.6 KB)16/12/29   11:55:36 INFO storage.BlockManagerInfo:添加了broadcast_2_piece0 in   localhost上的内存:35556(大小:19.6 KB,免费:511.1 MB)16/12/29   11:55:36 INFO spark.SparkContext:从showString创建广播2   在NativeMethodAccessorImpl.java:2 16/12/29 11:55:36 INFO   storage.MemoryStore:阻止broadcast_3作为值存储在内存中   (估计大小212.1 KB,免费309.7 KB)16/12/29 11:55:36信息   storage.MemoryStore:阻止broadcast_3_piece0存储为字节   内存(估计大小19.6 KB,免费329.2 KB)16/12/29 11:55:36信息   storage.BlockManagerInfo:在内存中添加了broadcast_3_piece0   localhost:35556(大小:19.6 KB,免费:511.1 MB)16/12/29 11:55:36信息   spark.SparkContext:从showString创建广播3   NativeMethodAccessorImpl.java:-2 Traceback(最近一次调用最后一次):
  文件"",第1行,在文件中   " /opt/spark/python/pyspark/sql/dataframe.py",第257行,在节目中       print(self._jdf.showString(n,truncate))File" /opt/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py" ;, line   813,在调用文件" /opt/spark/python/pyspark/sql/utils.py" ;,行   45,装饰       返回f(* a,** kw)文件" /opt/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py" ;,第308行,   in get_return_value py4j.protocol.Py4JJavaError:发生错误   同时调用o34.showString。 :java.io.IOException:没有输入路径   在工作中指定   org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)     在   org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)     在org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)     在   org.apache.spark.rdd.RDD $$ anonfun $分区$ 2.适用(RDD.scala:239)     在   org.apache.spark.rdd.RDD $$ anonfun $分区$ 2.适用(RDD.scala:237)     在scala.Option.getOrElse(Option.scala:120)at   org.apache.spark.rdd.RDD.partitions(RDD.scala:237)at   org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)     在   org.apache.spark.rdd.RDD $$ anonfun $分区$ 2.适用(RDD.scala:239)     在   org.apache.spark.rdd.RDD $$ anonfun $分区$ 2.适用(RDD.scala:237)     在scala.Option.getOrElse(Option.scala:120)at   org.apache.spark.rdd.RDD.partitions(RDD.scala:237)at   org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)     在   org.apache.spark.rdd.RDD $$ anonfun $分区$ 2.适用(RDD.scala:239)     在   org.apache.spark.rdd.RDD $$ anonfun $分区$ 2.适用(RDD.scala:237)     在scala.Option.getOrElse(Option.scala:120)at   org.apache.spark.rdd.RDD.partitions(RDD.scala:237)at   org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)     在   org.apache.spark.rdd.RDD $$ anonfun $分区$ 2.适用(RDD.scala:239)     在   org.apache.spark.rdd.RDD $$ anonfun $分区$ 2.适用(RDD.scala:237)     在scala.Option.getOrElse(Option.scala:120)at   org.apache.spark.rdd.RDD.partitions(RDD.scala:237)at   org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190)     在   org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)     在   org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)     在   org.apache.spark.sql.DataFrame $$ anonfun $ $组织阿帕奇$火花$ SQL $据帧$$执行$ 1 $ 1.适用(DataFrame.scala:1499)     在   org.apache.spark.sql.DataFrame $$ anonfun $ $组织阿帕奇$火花$ SQL $据帧$$执行$ 1 $ 1.适用(DataFrame.scala:1499)     在   org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:56)     在   org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)     在   org.apache.spark.sql.DataFrame.org $阿帕奇$火花$ SQL $据帧$$执行$ 1(DataFrame.scala:1498)     在   org.apache.spark.sql.DataFrame.org $阿帕奇$火花$ SQL $数据框$$收集(DataFrame.scala:1505)     在   org.apache.spark.sql.DataFrame $$ anonfun $头$ ​​1.适用(DataFrame.scala:1375)     在   org.apache.spark.sql.DataFrame $$ anonfun $头$ ​​1.适用(DataFrame.scala:1374)     在org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)     在org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)at   org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)at   org.apache.spark.sql.DataFrame.showString(DataFrame.scala:170)at at   sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     在   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:497)at   py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)at at   py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)at   py4j.Gateway.invoke(Gateway.java:259)at   py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)     在py4j.commands.CallCommand.execute(CallCommand.java:79)at   py4j.GatewayConnection.run(GatewayConnection.java:209)at   java.lang.Thread.run(Thread.java:745)

我为在StackOverflow中显示的混乱错误消息道歉,我不知道如何美化它。

当我这样做时,它起作用:

wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat',)],['word'])
wordsDF.show()
+--------+
|    word|
+--------+
|     cat|
|elephant|
|     rat|
|     rat|
|     cat|
+--------+

非常感谢。

1 个答案:

答案 0 :(得分:1)

感谢@ user6910411,他提供的链接就是我的问题的答案:

base_df = sqlContext.read.text("file:///root/Downloads/SogouQ1.txt")