无法显示Scala字数统计

时间:2018-12-04 16:51:24

标签: scala apache-spark

我正在尝试制作一个scala程序,该程序计算txt文件中的单词数并打印最终计数(在cloudera上并使用Spark)

import scala.io.Codec.string2codec
import scala.io.Source
import scala.reflect.io.File
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

 object SimpleWordCount {
  def main(args: Array[String]) {

  val conf = new SparkConf().setAppName("Simple Word Count")
  val sc = new SparkContext(conf)

程序将文件识别为正确的位置,因为当我将其放置在错误的位置时,它将文件识别为错误

  scala.io.Source.fromFile("/home/cloudera/Books/book1.txt")

  .getLines
  .flatMap(_.split("\\W+"))
  .foldLeft(Map.empty[String, Int]){
  (count, word) => count + (word -> (count.getOrElse(word, 0) + 1))

我尝试了不同的方式在此处打印行,但出现错误,即

 System.Out.Println(count)
 [error] /home/cloudera/src/main/scala/SimpleWordCount.scala:19:21: type mismatch;
 [error]  found   : Unit
 [error]  required: scala.collection.immutable.Map[String,Int]

 System.out.println(word,count)
 type mismatch;
 [error]  found   : Unit
 [error]  required: scala.collection.immutable.Map[String,Int]
 [error]   System.out.println(word,count)


  }

添加了以下行以检查程序是否正在运行

  System.out.println("This is working over here !!!!!!!!!#$%%E^$%^%%$%#$^%")

  }

 }

运行代码时,它将产生以下输出

 cloudera@quickstart ~]$ spark-submit --master=local[*] --class=SimpleWordCount /home/cloudera/target/scala-2.10/wordcount_2.10-1.0.jar 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/flume-ng/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.12.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]  
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/12/04 08:13:09 INFO spark.SparkContext: Running Spark version 1.6.0
18/12/04 08:13:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/12/04 08:13:10 WARN util.Utils: Your hostname, quickstart.cloudera resolves to a loopback address: 127.0.0.1; using 192.168.182.129 instead (on interface eth3)
18/12/04 08:13:10 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/12/04 08:13:10 INFO spark.SecurityManager: Changing view acls to: cloudera
18/12/04 08:13:10 INFO spark.SecurityManager: Changing modify acls to: cloudera
18/12/04 08:13:10 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cloudera); users with modify permissions: Set(cloudera)
18/12/04 08:13:10 INFO util.Utils: Successfully started service 'sparkDriver' on port 34679.
18/12/04 08:13:11 INFO slf4j.Slf4jLogger: Slf4jLogger started
18/12/04 08:13:11 INFO Remoting: Starting remoting
18/12/04 08:13:11 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.182.129:47272]
18/12/04 08:13:11 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriverActorSystem@192.168.182.129:47272]
18/12/04 08:13:11 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 47272.
18/12/04 08:13:11 INFO spark.SparkEnv: Registering MapOutputTracker
18/12/04 08:13:11 INFO spark.SparkEnv: Registering BlockManagerMaster
18/12/04 08:13:11 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-1f139fce-3b13-4c07-bed6-7b35f82ccc6a
18/12/04 08:13:11 INFO storage.MemoryStore: MemoryStore started with capacity 530.3 MB
18/12/04 08:13:11 INFO spark.SparkEnv: Registering OutputCommitCoordinator
18/12/04 08:13:11 INFO server.Server: jetty-8.y.z-SNAPSHOT
18/12/04 08:13:11 INFO server.AbstractConnector: Started   SelectChannelConnector@0.0.0.0:4040
18/12/04 08:13:11 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
 18/12/04 08:13:11 INFO ui.SparkUI: Started SparkUI at http://192.168.182.129:4040
18/12/04 08:13:11 INFO spark.SparkContext: Added JAR file:/home/cloudera/target/scala-2.10/wordcount_2.10-1.0.jar at spark://192.168.182.129:34679/jars/wordcount_2.10-1.0.jar with timestamp 1543939991769
18/12/04 08:13:11 INFO executor.Executor: Starting executor ID driver on host localhost
18/12/04 08:13:11 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 58495.
18/12/04 08:13:11 INFO netty.NettyBlockTransferService: Server created on 58495
18/12/04 08:13:11 INFO storage.BlockManagerMaster: Trying to register BlockManager
18/12/04 08:13:11 INFO storage.BlockManagerMasterEndpoint: Registering block manager localhost:58495 with 530.3 MB RAM, BlockManagerId(driver, localhost, 58495)
18/12/04 08:13:11 INFO storage.BlockManagerMaster: Registered BlockManager

PrintLn命令似乎在这里起作用

This is working over here !!!!!!!!!#$%%E^$%^%%$%#$^%
18/12/04 08:13:12 INFO spark.SparkContext: Invoking stop() from shutdown hook
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null} 
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped   o.s.j.s.ServletContextHandler{/static,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
18/12/04 08:13:12 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
18/12/04 08:13:12 INFO ui.SparkUI: Stopped Spark web UI at http://192.168.182.129:4040
18/12/04 08:13:12 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/12/04 08:13:12 INFO storage.MemoryStore: MemoryStore cleared
18/12/04 08:13:12 INFO storage.BlockManager: BlockManager stopped
18/12/04 08:13:12 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
18/12/04 08:13:12 INFO      scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:   OutputCommitCoordinator stopped!
18/12/04 08:13:12 INFO spark.SparkContext: Successfully stopped SparkContext
18/12/04 08:13:12 INFO util.ShutdownHookManager: Shutdown hook called
18/12/04 08:13:12 INFO remote.RemoteActorRefProvider$RemotingTerminator:     Shutting down remote daemon.
18/12/04 08:13:12 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-e468a57d-10e2-472e-98f2-f3701f6a4b1a

1 个答案:

答案 0 :(得分:1)

也许这可以帮助您?
就像我说的那样,问题在于您不能在foldLeft内部返回打印件,而只能添加一个然后返回。

val lines = List(
  "Hello, World!",
  "Goodbye, World!",
  "Hello, Hadoop!"
)

val wordCount =
  lines
    .flatMap(_.split("\\W+"))
    .foldLeft(Map.empty[String, Int]) {
      (count, word) =>
        println(s"DEBUG count: $count for word: '$word'.")
        count + (word -> (count.getOrElse(word, 0) + 1))
    }

val formatteWordCount =
  wordCount
    .map(tuple => s"${tuple._1} -> ${tuple._2}")
    .mkString("\n", "\n", "\n")
println(s"Final Word Count: $formatteWordCount")

输出

  

调试计数:单词“ Hello”的Map()。
  DEBUG计数:Map(Hello-> 1)表示“ World”。
  调试计数:针对单词“再见”的Map(Hello-> 1,World-> 1)。
  调试计数:针对单词“ World”的Map(Hello-> 1,World-> 1,1,再见-> 1)。
  DEBUG计数:针对单词“ Hello”的Map(Hello-> 1,World-> 2,Goodbye-> 1)。
  DEBUG计数:单词“ Hadoop”的Map(Hello-> 2,World-> 2,Goodbye-> 1)。
  最终字数:
  您好-> 2
  世界-> 2
  再见-> 1
  Hadoop-> 1