Question

我目前刚接触Spark，我正在使用Scala。我在遍历RDD Key Value对时遇到了一些麻烦。我有一个TSV文件，file1，其中包括Country Name，Latittude和Longitude，我到目前为止;

val a = file1.map(_.split("\t")).map(rec => (rec(1), (rec(11).toDouble, rec(12).toDouble)))

rec(1)是国家/地区名称，rec(11)是经度，rec(12)是纬度。据我所知，a现在是一个键值对，rec(1)是键，rec（11）和rec（12）是值。我设法测试a.first._1 gives第一个密钥 a.first._2._1给出密钥的第一个值。 a.first._2._2给出了密钥的第二个值。

我的目标是至少设法使用相同的密钥获取所有rec(11)的平均值，并使用rec(12)进行相同操作。所以我的想法是将它们全部加起来然后除以该键的键值对的数量。

有人可以帮助我接下来应该做些什么吗？我尝试使用map，flatValueMap，valueMap，groupByKey等等，但我似乎无法总结rec(11)＆＃39; s和{ {1}}同时进行。

Answer 1

您可以使用groupByKey，然后使用agg

进行avg操作

这是一个简单的例子：

原创DF：

+------------+-----+
|country code|pairs|
+------------+-----+
|          ES|[1,2]|
|          UK|[2,3]|
|          ES|[4,5]|
+------------+-----+

执行操作：

df.groupBy($"country code").agg(avg($"pairs._1"), avg($"pairs._2"))

结果：

+------------+-------------+-------------+
|country code|avg(pairs._1)|avg(pairs._2)|
+------------+-------------+-------------+
|          ES|          2.5|          3.5|
|          UK|          2.0|          3.0|
+------------+-------------+-------------+

Answer 2

我的目标是至少设法用相同的键获得所有rec（11）的平均值，并且与rec（12）相同

您可以按照以下步骤进行操作（为清晰起见，请注明）

a.mapValues(x => (x, 1))    //putting counter to the values of (k, (v1, v2)) as (k, ((v1, v2), 1))
  .reduceByKey{case(x,y) => ((x._1._1+y._1._1, x._1._2+y._1._2), x._2+y._2)}  //summing separately all the values of v1, all the values of v2 and the counter of same key
  .map{case(x, y)=> (x, (y._1._1/y._2, y._1._2/y._2))}  //finding the average i.e. deviding the sum of v1 and v1 by counter sum separately

这在https://stackoverflow.com/a/49166009/5880706

中都有解释

具有多个值

2 个答案: