Question

我有一个输入文件，如下所示：

2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 62
4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64
6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66
8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 65 68
10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70

如何在Spark中找到所有这些数字的平均值？到目前为止，我已经能够编写代码了。

val x1 = input.map( (value:String)=> value.split(" ") )

（输入是包含所有数字的输入文本文件的hdfs位置）

Answer 1

您可以使用Spark SQL的Dataset API或Spark Core的RDD API编写解决方案。我强烈建议使用Spark SQL。

我们假设以下lines数据集。

val lines = spark.read.text("input.txt").toDF("line")
scala> lines.show(truncate = false)
+--------------------------------------------------------------+
|line                                                          |
+--------------------------------------------------------------+
|2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 62   |
|4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64  |
|6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66  |
|8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 65 68 |
|10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70|
+--------------------------------------------------------------+

（您调用了上述数据集input，但lines更有意义 - 抱歉混淆了。）

这样，您只需将split行转换为“数字”，即字符串文字。

val numArrays = lines.withColumn("nums", split($"line", "\\s+"))
scala> numArrays.printSchema
root
 |-- line: string (nullable = true)
 |-- nums: array (nullable = true)
 |    |-- element: string (containsNull = true)

scala> numArrays.select("nums").show(truncate = false)
+------------------------------------------------------------------------------------+
|nums                                                                                |
+------------------------------------------------------------------------------------+
|[2, 5, 8, 11, 14, 17, 20, 23, 26, 29, 32, 35, 38, 41, 44, 47, 50, 53, 56, 59, 62]   |
|[4, 7, 10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49, 52, 55, 58, 61, 64]  |
|[6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 63, 66]  |
|[8, 11, 14, 17, 20, 23, 26, 29, 32, 35, 38, 41, 44, 47, 50, 53, 56, 59, 62, 65, 68] |
|[10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49, 52, 55, 58, 61, 64, 67, 70]|
+------------------------------------------------------------------------------------+

从数组计算 a thing 的一个Spark成语是explode首先跟groupBy。这可能不是最有效的解决方案，但这取决于这些线是否是唯一的（我假设它们是）以及数据集的实际大小。

val ns = numArrays.withColumn("n", explode($"nums"))
scala> ns.show
+--------------------+--------------------+---+
|                line|                nums|  n|
+--------------------+--------------------+---+
|2 5 8 11 14 17 20...|[2, 5, 8, 11, 14,...|  2|
|2 5 8 11 14 17 20...|[2, 5, 8, 11, 14,...|  5|
|2 5 8 11 14 17 20...|[2, 5, 8, 11, 14,...|  8|
|2 5 8 11 14 17 20...|[2, 5, 8, 11, 14,...| 11|
|2 5 8 11 14 17 20...|[2, 5, 8, 11, 14,...| 14|
|2 5 8 11 14 17 20...|[2, 5, 8, 11, 14,...| 17|
|2 5 8 11 14 17 20...|[2, 5, 8, 11, 14,...| 20|
|2 5 8 11 14 17 20...|[2, 5, 8, 11, 14,...| 23|
|2 5 8 11 14 17 20...|[2, 5, 8, 11, 14,...| 26|
|2 5 8 11 14 17 20...|[2, 5, 8, 11, 14,...| 29|
|2 5 8 11 14 17 20...|[2, 5, 8, 11, 14,...| 32|
|2 5 8 11 14 17 20...|[2, 5, 8, 11, 14,...| 35|
|2 5 8 11 14 17 20...|[2, 5, 8, 11, 14,...| 38|
|2 5 8 11 14 17 20...|[2, 5, 8, 11, 14,...| 41|
|2 5 8 11 14 17 20...|[2, 5, 8, 11, 14,...| 44|
|2 5 8 11 14 17 20...|[2, 5, 8, 11, 14,...| 47|
|2 5 8 11 14 17 20...|[2, 5, 8, 11, 14,...| 50|
|2 5 8 11 14 17 20...|[2, 5, 8, 11, 14,...| 53|
|2 5 8 11 14 17 20...|[2, 5, 8, 11, 14,...| 56|
|2 5 8 11 14 17 20...|[2, 5, 8, 11, 14,...| 59|
+--------------------+--------------------+---+
only showing top 20 rows

使用ns计算平均值的数字是轻而易举的。

val avgs = ns.groupBy("line").agg(avg($"n") as "avg")
scala> avgs.show(truncate = false)
+--------------------------------------------------------------+----+
|line                                                          |avg |
+--------------------------------------------------------------+----+
|10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70|40.0|
|2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 62   |32.0|
|6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66  |36.0|
|8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 65 68 |38.0|
|4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64  |34.0|
+--------------------------------------------------------------+----+

另一种解决方案可能是使用用户定义的函数并直接在数组上计算平均值。如果用户定义的函数超过上述解决方案，我不会感到惊讶。

Answer 2

解决方案1 

 val input= spark.sparkContext.textFile("file:///D:/Fast-Nu/input.txt")  // it is local path you can give here hdfs path

 val x1= input.flatMap(_.split("\\s"))       //_.split("\\s") is same as (x=>x.split("\\s")) 
 val x2 = x1.map(_.toInt)                   // _.toInt same as x=>x.toInt

 val agg = x2.aggregate((0,0))(
 (x,value)=>(x._1+value,x._2+1),
 (x1,x2)=> (x1._1+x2._1, x1._2+x2._2 ) )

 val average = agg._1/agg._2.toDouble
 println(average)

解决方案2

val input= spark.sparkContext.textFile("file:///D:/Fast-Nu/input.txt")    // it is local path you can give here hdfs path

val x1= input.flatMap(_.split("\\s"))
val x2 = x1.map(_.toInt)

val avg = x2.mean
println(avg)

解决方案3

val input= spark.sparkContext.textFile("file:///D:/Fast-Nu/input.txt")      // it is local path you can give here hdfs path

val x1= input.flatMap(_.split("\\s"))
val x2 = x1.map(_.toInt)

val x3 = x2.map(x=>(x,1)).reduce((x,y)=>(x._1+y._1, x._2+y._2))
val avg= x3._1.toDouble/x3._2.toDouble

println(avg)

Answer 3

继承人更简单的是所有记录都在一行中分隔。如果数字在单独的行中，则可以相应地进行更改。

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

val conf = new Configuration(sc.hadoopConfiguration)
conf.set("textinputformat.record.delimiter", " ")
val input = sc.newAPIHadoopFile("file_path", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
val inputRdd = input.map { case (_, text) => text.toString.toLong}

这将创建一个rdd，每个数字作为一个元素，接下来，

val tup = inputRdd
      .map((_, 1L))
      .reduce(reducer)

val avg = tup._1/ tup._2

减速器在哪里，

def reducer(a: (Long, Long), b: (Long, Long)): (Long, Long) = (a._1 + b._1, a._2 + b._2)

平均值是你的结果。

希望这会有所帮助，欢呼。

如何计算文本输入文件中的平均数？

3 个答案: