RDD不起作用

时间:2016-12-18 16:52:58

标签: apache-spark pycharm pyspark

我'我目前正在开展一个项目,似乎无法克服火花中的错误。 像.first()和.collect()这样的函数不会给出结果。 这是我的代码:

import os
import sys

# Path for spark source folder
os.environ['SPARK_HOME']="C:\spark-2.0.1-bin-hadoop2.7"

# Append pyspark  to Python Path
sys.path.append("C:\spark-2.0.1-bin-hadoop2.7\python ")

try:
    from pyspark import SparkContext
    from pyspark import SparkConf

    print ("Successfully imported Spark Modules")

except ImportError as e:
    print ("Can not import Spark Modules", e)
    sys.exit(1)

import re

sc = SparkContext()
file = sc.textFile('rC:\\essay.txt')

word = file.map(lambda line: re.split(r'[?:\n|\s]\s*', line))

word.first() 

当我在pycharm上运行它时。它生成以下内容:

Successfully imported Spark Modules

16/12/18 17:23:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/18 17:23:43 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
Traceback (most recent call last):
  File "C:/Users/User1/PycharmProjects/BigData/SparkMatrice.py", line 43, in <module>
    word.first()
  File "C:\spark-2.0.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py", line 1328, in first
  File "C:\spark-2.0.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py", line 1280, in take
  File "C:\spark-2.0.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\rdd.py", line 2388, in getNumPartitions
  File "C:\spark-2.0.1-bin-hadoop2.7\python\lib\py4j-0.10.3-src.zip\py4j\java_gateway.py", line 1133, in __call__
  File "C:\spark-2.0.1-bin-hadoop2.7\python\lib\py4j-0.10.3-src.zip\py4j\protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o19.partitions.
: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: rC:%5Cessay.txt
    at org.apache.hadoop.fs.Path.initialize(Path.java:205)
    at org.apache.hadoop.fs.Path.<init>(Path.java:171)
    at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:245)
    at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:411)
    at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$29.apply(SparkContext.scala:992)
    at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$29.apply(SparkContext.scala:992)
    at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
    at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
    at scala.Option.map(Option.scala:146)
    at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
    at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:60)
    at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: rC:%5Cessay.txt
    at java.net.URI.checkPath(Unknown Source)
    at java.net.URI.<init>(Unknown Source)
    at org.apache.hadoop.fs.Path.initialize(Path.java:202)
    ... 32 more 

当我用.collect()替换.first()时会发生同样的事情。(当我使用终端而不是pycharm时会发生同样的事情)。 我希望有人可以帮我找出问题所在。

1 个答案:

答案 0 :(得分:2)

问题列在那里,你的路径错了:

  

引起:java.net。 URISyntaxException :绝对URI中的相对路径: rC:%5Cessay.txt       在java.net.URI。 checkPath (未知来源)

您需要更改

file = sc.textFile('rC:\\essay.txt')

file = sc.textFile(r'C:\\essay.txt')