我已经使用cloudera manager安装了cloudera CDH 5.
我可以轻松做到
hadoop fs -ls /input/war-and-peace.txt
hadoop fs -cat /input/war-and-peace.txt
以上命令将在控制台上打印整个txt文件。
现在我启动火花壳并说
val textFile = sc.textFile("hdfs://input/war-and-peace.txt")
textFile.count
现在我收到错误
Spark上下文可用作sc。
scala> val textFile = sc.textFile("hdfs://input/war-and-peace.txt")
2014-12-14 15:14:57,874 INFO [main] storage.MemoryStore (Logging.scala:logInfo(59)) - ensureFreeSpace(177621) called with curMem=0, maxMem=278302556
2014-12-14 15:14:57,877 INFO [main] storage.MemoryStore (Logging.scala:logInfo(59)) - Block broadcast_0 stored as values in memory (estimated size 173.5 KB, free 265.2 MB)
textFile: org.apache.spark.rdd.RDD[String] = hdfs://input/war-and-peace.txt MappedRDD[1] at textFile at <console>:12
scala> textFile.count
2014-12-14 15:15:21,791 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 0 time(s); maxRetries=45
2014-12-14 15:15:41,905 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 1 time(s); maxRetries=45
2014-12-14 15:16:01,925 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 2 time(s); maxRetries=45
2014-12-14 15:16:21,983 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 3 time(s); maxRetries=45
2014-12-14 15:16:42,001 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 4 time(s); maxRetries=45
2014-12-14 15:17:02,062 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 5 time(s); maxRetries=45
2014-12-14 15:17:22,082 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 6 time(s); maxRetries=45
2014-12-14 15:17:42,116 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 7 time(s); maxRetries=45
2014-12-14 15:18:02,138 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 8 time(s); maxRetries=45
2014-12-14 15:18:22,298 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 9 time(s); maxRetries=45
2014-12-14 15:18:42,319 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 10 time(s); maxRetries=45
2014-12-14 15:19:02,354 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 11 time(s); maxRetries=45
2014-12-14 15:19:22,373 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 12 time(s); maxRetries=45
2014-12-14 15:19:42,424 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 13 time(s); maxRetries=45
2014-12-14 15:20:02,446 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 14 time(s); maxRetries=45
2014-12-14 15:20:22,512 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 15 time(s); maxRetries=45
2014-12-14 15:20:42,515 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 16 time(s); maxRetries=45
2014-12-14 15:21:02,550 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 17 time(s); maxRetries=45
2014-12-14 15:21:22,558 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 18 time(s); maxRetries=45
2014-12-14 15:21:42,683 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 19 time(s); maxRetries=45
2014-12-14 15:22:02,702 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 20 time(s); maxRetries=45
2014-12-14 15:22:22,832 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 21 time(s); maxRetries=45
2014-12-14 15:22:42,852 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 22 time(s); maxRetries=45
2014-12-14 15:23:02,974 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 23 time(s); maxRetries=45
2014-12-14 15:23:22,995 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 24 time(s); maxRetries=45
2014-12-14 15:23:43,109 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 25 time(s); maxRetries=45
2014-12-14 15:24:03,128 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 26 time(s); maxRetries=45
2014-12-14 15:24:23,250 INFO [main] ipc.Client (Client.java:handleConnectionTimeout(814)) - Retrying connect to server: input/92.242.140.21:8020. Already tried 27 time(s); maxRetries=45
java.net.ConnectException: Call From dn1home/192.168.1.21 to input:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1415)
为什么我收到此错误?我可以使用hadoop命令读取同一个文件吗?
答案 0 :(得分:54)
这是解决方案
sc.textFile("hdfs://nn1home:8020/input/war-and-peace.txt")
我如何找到nn1home:8020?
只需搜索文件core-site.xml
并查找xml元素fs.defaultFS
答案 1 :(得分:7)
如果你想使用sc.textFile("hdfs://...")
,你需要给出完整的路径(绝对路径),在你的例子中将是&#34; nn1home:8020 /.."
如果您想简化,只需使用sc.textFile("hdfs:/input/war-and-peace.txt")
那只有一个/
答案 2 :(得分:6)
这将有效:
val textFile = sc.textFile("hdfs://localhost:9000/user/input.txt")
在这里,您可以从hadoop localhost:9000
配置文件的core-site.xml
参数值中获取fs.defaultFS
。
答案 3 :(得分:2)
您没有传递正确的网址字符串。
hdfs://
- 协议类型localhost
- ip地址(可能与您有所不同,例如。 - 127.56.78.4)54310
- 端口号/input/war-and-peace.txt
- 要加载的文件的完整路径。最后,网址应该是这样的
hdfs://localhost:54310/input/war-and-peace.txt
答案 4 :(得分:1)
我也使用CDH5。对我来说,完整的路径我,e&#34; hdfs:// nn1home:8020&#34;因某些奇怪的原因而无法工作。大多数示例都显示了这样的路径。
我使用了像
这样的命令val textFile=sc.textFile("hdfs:/input1/Card_History2016_3rdFloor.csv")
以上命令的o / p:
textFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at <console>:22
textFile.count
res1: Long = 58973
这对我来说很好。
答案 5 :(得分:1)
这对我有用
logFile = "hdfs://localhost:9000/sampledata/sample.txt"
答案 6 :(得分:1)
如果你在spark-env.sh中设置HADOOP_HOME启动spark,spark会知道在哪里查找hdfs配置文件。
在这种情况下,spark已经知道你的namenode / datanode的位置,只有下面才能正常访问hdfs文件;
sc.textFie("/myhdfsdirectory/myfiletoprocess.txt")
您可以按照以下方式创建myhdfs目录;
hdfs dfs -mkdir /myhdfsdirectory
从您的本地文件系统,您可以使用以下命令将myfiletoprocess.txt移动到hdfs目录
hdfs dfs -copyFromLocal mylocalfile /myhdfsdirectory/myfiletoprocess.txt
答案 7 :(得分:1)
val conf = new SparkConf().setMaster("local[*]").setAppName("HDFSFileReader")
conf.set("fs.defaultFS", "hdfs://hostname:9000")
val sc = new SparkContext(conf)
val data = sc.textFile("hdfs://hostname:9000/hdfspath/")
data.saveAsTextFile("C:\\dummy\")
上面的代码从目录读取所有hdfs文件,并将其保存在c:// dummy文件夹中。
答案 8 :(得分:1)
它也可能是文件路径或URL和hdfs端口的问题。
<强>解决方案:强>
首先从位置D
打开core-site.xml
文件,然后检查属性$HADOOP_HOME/etc/hadoop
的值。
假设值为fs.defaultFS
,hdfs中的文件位置为hdfs://localhost:9000
。
然后,文件网址为:/home/usr/abc/fileName.txt
以及用于从hdfs读取文件的以下命令:
hdfs://localhost:9000/home/usr/abc/fileName.txt
答案 9 :(得分:1)
从core-site.xml(/ etc / hadoop / conf)获取fs.defaultFS URL并按如下方式读取文件。就我而言,fs.defaultFS是hdfs://quickstart.cloudera:8020
txtfile = sc.textFile('hdfs://quickstart.cloudera:8020 / user / cloudera / rddoutput') txtfile.collect()