我正在尝试从TCP端口流式传输数据,并使用Spark-Streaming将数据加载到HDFS中。
文件是在HDFS中创建的,但它们都是空的。但Spark Streaming控制台显示从TCP端口读取数据。
我在Spark 0.9.0,0.9.1和1.0中使用CDH-5中的Scala-Shell尝试了这个。我在另一个终端做了一个'nc -lk 9993'来传输数据。
以下是代码,请告知我们如何解决此问题。感谢。
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.api.java.function._
import org.apache.spark.streaming._
import org.apache.spark.streaming.api._
import org.apache.spark.streaming.StreamingContext._
import StreamingContext._
val ssc8 = new StreamingContext("local", "NetworkWordCount", Seconds(1))
val lines8 = ssc8.socketTextStream("localhost", 9993)
val words8 = lines8.flatMap(_.split(" "))
val pairs8 = words8.map(word => (word, 1))
val wordCounts8 = pairs8.reduceByKey(_ + _)
wordCounts8.saveAsTextFiles("hdfs://Node1:8020/user/root/Spark8")
wordCounts8.print()
ssc8.start()
附加---------------------------------------
我在下面提供了日志和HDFS文件 -
HDFS Output Files
--------------------
-rw-r--r-- 3 user1 user1 0 2014-06-26 09:19 /user/user1/SparkV/_SUCCESS
-rw-r--r-- 3 user1 user1 0 2014-06-26 09:19 /user/user1/SparkV/part-00000
-rw-r--r-- 3 user1 user1 0 2014-06-26 09:19 /user/user1/SparkV/part-00001
Spark-Shell Console Log
---------------------
-------------------------------------------
Time: 1403789836000 ms
-------------------------------------------
(f,3)
(fsd,2)
(sdf,2)
(fds,1)
(sd,3)
14/06/26 09:37:16 INFO scheduler.JobScheduler: Finished job streaming job 1403789836000 ms.1 from job set of time 1403789836000 ms
14/06/26 09:37:16 INFO storage.MemoryStore: ensureFreeSpace(8) called with curMem=327, maxMem=286339891
14/06/26 09:37:16 INFO storage.MemoryStore: Block input-0-1403789836000 stored as bytes to memory (size 8.0 B, free 273.1 MB)
14/06/26 09:37:16 INFO storage.BlockManagerInfo: Added input-0-1403789836000 in memory on localhost:49784 (size: 8.0 B, free: 273.1 MB)
14/06/26 09:37:16 INFO storage.BlockManagerMaster: Updated info of block input-0-1403789836000
14/06/26 09:37:16 WARN storage.BlockManager: Block input-0-1403789836000 already exists on this machine; not re-adding it
14/06/26 09:37:16 INFO receiver.BlockGenerator: Pushed block input-0-1403789836000
14/06/26 09:37:16 INFO storage.MemoryStore: ensureFreeSpace(15) called with curMem=335, maxMem=286339891
14/06/26 09:37:16 INFO storage.MemoryStore: Block input-0-1403789836200 stored as bytes to memory (size 15.0 B, free 273.1 MB)
14/06/26 09:37:16 INFO storage.BlockManagerInfo: Added input-0-1403789836200 in memory on localhost:49784 (size: 15.0 B, free: 273.1 MB)
14/06/26 09:37:16 INFO storage.BlockManagerMaster: Updated info of block input-0-1403789836200
14/06/26 09:37:16 WARN storage.BlockManager: Block input-0-1403789836200 already exists on this machine; not re-adding it
14/06/26 09:37:16 INFO receiver.BlockGenerator: Pushed block input-0-1403789836200
14/06/26 09:37:16 INFO storage.MemoryStore: ensureFreeSpace(8) called with curMem=350, maxMem=286339891
14/06/26 09:37:16 INFO storage.MemoryStore: Block input-0-1403789836400 stored as bytes to memory (size 8.0 B, free 273.1 MB)
14/06/26 09:37:16 INFO storage.BlockManagerInfo: Added input-0-1403789836400 in memory on localhost:49784 (size: 8.0 B, free: 273.1 MB)
14/06/26 09:37:16 INFO storage.BlockManagerMaster: Updated info of block input-0-1403789836400
14/06/26 09:37:16 WARN storage.BlockManager: Block input-0-1403789836400 already exists on this machine; not re-adding it
14/06/26 09:37:16 INFO receiver.BlockGenerator: Pushed block input-0-1403789836400
14/06/26 09:37:16 INFO storage.MemoryStore: ensureFreeSpace(9) called with curMem=358, maxMem=286339891
14/06/26 09:37:16 INFO storage.MemoryStore: Block input-0-1403789836600 stored as bytes to memory (size 9.0 B, free 273.1 MB)
14/06/26 09:37:16 INFO storage.BlockManagerInfo: Added input-0-1403789836600 in memory on localhost:49784 (size: 9.0 B, free: 273.1 MB)
14/06/26 09:37:16 INFO storage.BlockManagerMaster: Updated info of block input-0-1403789836600
14/06/26 09:37:16 WARN storage.BlockManager: Block input-0-1403789836600 already exists on this machine; not re-adding it
14/06/26 09:37:16 INFO receiver.BlockGenerator: Pushed block input-0-1403789836600
14/06/26 09:37:17 INFO storage.MemoryStore: ensureFreeSpace(14) called with curMem=367, maxMem=286339891
14/06/26 09:37:17 INFO storage.MemoryStore: Block input-0-1403789836800 stored as bytes to memory (size 14.0 B, free 273.1 MB)
14/06/26 09:37:17 INFO storage.BlockManagerInfo: Added input-0-1403789836800 in memory on localhost:49784 (size: 14.0 B, free: 273.1 MB)
14/06/26 09:37:17 INFO storage.BlockManagerMaster: Updated info of block input-0-1403789836800
14/06/26 09:37:17 WARN storage.BlockManager: Block input-0-1403789836800 already exists on this machine; not re-adding it
14/06/26 09:37:17 INFO receiver.BlockGenerator: Pushed block input-0-1403789836800
14/06/26 09:37:18 INFO scheduler.ReceiverTracker: Stream 0 received 6 blocks
14/06/26 09:37:18 INFO scheduler.JobScheduler: Added jobs for time 1403789838000 ms
答案 0 :(得分:2)
乍一看我的猜测是你应该尝试本地[4]而不是本地,所以Spark可以安排更多任务。
答案 1 :(得分:0)
wordCounts8.saveAsTextFiles(" hdfs:// Node1:8020 / user / root / Spark8"," log")
========== 或
wordCounts8.saveAsTextFiles(" hdfs:// Node1:8020 / user / root / Spark8" + System.currentTimeMillis()。toString())
========== 火花1.3对我有用,看看它是否适合你
答案 2 :(得分:0)
我遇到了同样的问题。
尝试运行
hadoop fs -cat hdfs://Node1:8020/user/root/Spark8
(hadoop命令可能与您不同。对我来说,我必须使用/ a / bin / hadoop访问它,但这是特定于您的设置)
看看是否会返回:
cat: `hdfs://Node1:8020/user/root/Spark8': Is a directory
如果确实如此,那么正如您在评论中所说,您应该能够在该目录中看到_SUCCESS文件以及一些part- *文件。
此时,我的问题已经解决了。但是你写HDFS时似乎还有其他问题。
至于为什么你的文件仍然是空的,我建议切换到Spark1.4.0,因为使用CDH5.4可能会有更好的效果。此外,如果您遇到HDFS权限问题,则必须执行
hadoop dfs -chmod -R 0777 /your_hdfs_folder
以便具有写访问权。