火花Hadoop-> org.apache.hadoop.mapred.InvalidInputException:输入路径不存在

时间:2016-02-26 00:44:09

标签: hadoop apache-spark

我在尝试从hdfs读取文件到Spark时遇到错误。文件README.md出现在hdfs

 spark@osboxes hadoop]$ hdfs dfs -ls README.md
16/02/26 00:29:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-rw-r--r--   1 spark supergroup       4811 2016-02-25 23:38 README.md

在Spark shell中,我给了

scala> val readme = sc.textFile("hdfs://localhost:9000/README.md")
readme: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:27

scala> readme.count
16/02/26 00:25:26 DEBUG BlockManager: Getting local block broadcast_4
16/02/26 00:25:26 DEBUG BlockManager: Level for block broadcast_4 is StorageLevel(true, true, false, true, 1)
16/02/26 00:25:26 DEBUG BlockManager: Getting block broadcast_4 from memory
16/02/26 00:25:26 DEBUG HadoopRDD: Creating new JobConf and caching it for later re-use
16/02/26 00:25:26 DEBUG Client: The ping interval is 60000 ms.
16/02/26 00:25:26 DEBUG Client: Connecting to localhost/127.0.0.1:9000
16/02/26 00:25:26 DEBUG Client: IPC Client (648679508) connection to localhost/127.0.0.1:9000 from spark: starting, having connections 1
16/02/26 00:25:26 DEBUG Client: IPC Client (648679508) connection to localhost/127.0.0.1:9000 from spark sending #4
16/02/26 00:25:26 DEBUG Client: IPC Client (648679508) connection to localhost/127.0.0.1:9000 from spark got value #4
16/02/26 00:25:26 DEBUG ProtobufRpcEngine: Call: getFileInfo took 6ms
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://localhost:9000/README.md
        at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
        at org.apache.spark.rdd.RDD.count(RDD.scala:1143)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:30)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:35)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:37)
        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39)
        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:41)
        at $iwC$$iwC$$iwC.<init>(<console>:43)
        at $iwC$$iwC.<init>(<console>:45)
        at $iwC.<init>(<console>:47)
        at <init>(<console>:49)
        at .<init>(<console>:53)
        at .<clinit>(<console>)
        at .<init>(<console>:7)
        at .<clinit>(<console>)
        at $print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
        at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
        at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
        at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
        at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
        at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
        at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
        at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
        at org.apache.spark.repl.Main$.main(Main.scala:31)
        at org.apache.spark.repl.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


scala> 16/02/26 00:25:36 DEBUG Client: IPC Client (648679508) connection to localhost/127.0.0.1:9000 from spark: closed
16/02/26 00:25:36 DEBUG Client: IPC Client (648679508) connection to localhost/127.0.0.1:9000 from spark: stopped, remaining connections 0

在core-site.xml中,我有以下条目:

<configuration>
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
</property>

和hdfs-site.xml的详细信息如下:

<configuration>
<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>

我在这里遗漏了什么吗? 我的操作系统是CentOS Linux版本7.2.1511(核心),Hadoop是2.7.2,Spark是1.6.0-bin-hadoop2.6

5 个答案:

答案 0 :(得分:5)

这是由于目录之间的内部映射而发生的。 首先转到保存文件(README.md)的目录。 运行命令:df -k .。 您将获得目录的实际安装点。 例如:/xyz 现在,尝试在此挂载点中查找文件(README.md)。 例如:/xyz/home/omi/myDir/README.md 在代码中使用此路径。 val readme = sc.textfile("/xyz/home/omi/myDir/README.md");

答案 1 :(得分:2)

默认情况下,hdfs dfs -ls将在hdfs上显示您的用户主文件夹,而不是hdfs的根目录。您可以通过比较hdfs dfs -lshdfs dfs -ls /的输出来轻松验证这一点。当您使用完整的hdfs URL时,您使用的是绝对路径,并且它找不到您的文件(因为它位于您的用户主文件夹中)。当您使用相对路径时,问题就消失了:)

您可能想知道hdfs dfs -put也将使用您的hdfs主文件夹作为文件的默认目标,而不是hdfs的根目录。

答案 2 :(得分:0)

在我的情况下,README.md文件位于主目录中的Spark(spark-2.4.3-bin-hadoop2.7)文件夹中。

这样,完整路径为“ /home/sdayneko/spark-2.4.3-bin-hadoop2.7/README.md”

我将此路径放入输入变量中:

val input = sc.textFile("/home/sdayneko/spark-2.4.3-bin-hadoop2.7/README.md")

之后,它就起作用了:)

答案 3 :(得分:0)

我已经遇到了这个问题,发现如果表损坏了,您会遇到这个问题。

show partitions myschema.mytable; 结果: partitionkey = abc partitionkey = xyz

如果在表文件夹的hdfs上执行ls

ls -ltr hdfs://servername/data/fid/work/hive/myschema/mytable partitionkey=abc

您只会得到不匹配的分区文件夹。

在阅读Spark时...您会遇到此问题

org.apache.hadoop.mapred.invalidinputexception input path does not

您将必须删除分区或msck修复表才能解决此问题。 感谢和问候, Kamleshkumar Gujarathi

答案 4 :(得分:-1)

您可以尝试将命令更改为如下,然后运行

val readme = sc.textFile("./README.md")