如何用Map [String,Long]列作为DataFrame的头部并保存类型?

时间:2018-12-12 08:22:38

标签: scala apache-spark apache-spark-sql

我有一个已应用filter条件的数据框

val colNames = customerCountDF
  .filter($"fiscal_year" === maxYear && $"fiscal_month" === maxMnth)

在所有选定的行中,我只想要一行的最后一列。

最后一列类型为Map[String, Long]。我希望地图的所有键都为List[String]

我尝试使用以下语法

val colNames = customerCountDF
  .filter($"fiscal_year" === maxYear && $"fiscal_month" === maxMnth)
  .head
  .getMap(14)
  .keySet
  .toList
  .map(_.toString)

我正在使用map(_.toString)List[Nothing]转换为List[String]。我得到的错误是:

missing parameter type for expanded function ((x$1) => x$1.toString)
[error]        val colNames = customerCountDF.filter($"fiscal_year" === maxYear && $"fiscal_month" === maxMnth).head().getMap(14).keySet.toList.map(_.toString)

df如下:

+-------------+-----+----------+-----------+------------+-------------+--------------------+--------------+--------+----------------+-----------+----------------+-------------+-------------+--------------------+
|division_name|  low| call_type|fiscal_year|fiscal_month|  region_name|abandon_rate_percent|answered_calls|connects|equiv_week_calls|equiv_weeks|equivalent_calls|num_customers|offered_calls|                  pv|
+-------------+-----+----------+-----------+------------+-------------+--------------------+--------------+--------+----------------+-----------+----------------+-------------+-------------+--------------------+
|     NATIONAL|PHONE|CABLE CARD|       2016|           1|ALL DIVISIONS|                0.02|         10626|       0|             0.0|        0.0|         10649.8|            0|        10864|Map(subscribers_c...|
|     NATIONAL|PHONE|CABLE CARD|       2016|           1|      CENTRAL|                0.02|          3591|       0|             0.0|        0.0|          3598.6|            0|         3667|Map(subscribers_c...|
+-------------+-----+----------+-----------+------------+-------------+--------------------+--------------+--------+----------------+-----------+----------------+-------------+-------------+--------------------+

刚选择的最后一列是

[Map(subscribers_connects -> 5521287, disconnects_hsd -> 7992, subscribers_xfinity home -> 6277491, subscribers_bulk units -> 4978892, connects_cdv -> 41464, connects_disconnects -> 16945, connects_hsd -> 32908, disconnects_internet essentials -> 10319, disconnects_disconnects -> 3506, disconnects_video -> 8960, connects_xfinity home -> 43012)] 

在应用过滤条件并从数据帧中仅获取一行之后,我希望将最后一列的键作为List[String]

4 个答案:

答案 0 :(得分:2)

通过在源getMap(14)上显式指定类型参数,可以轻松解决类型问题。由于您知道您期望Map个键值对中的String -> Int,因此只需将getMap(14)替换为getMap[String, Int](14)

getMap[String, Int](14)为空的Map而言,这与您的数据有关,您只需在index 14行的head处有一个空的映射

更多详细信息

在Scala中,当您创建List[A]时,Scala会使用可用信息来推断类型。

例如,

// Explicitly provide the type parameter info
scala> val l1: List[Int] = List(1, 2)
// l1: List[Int] = List(1, 2)

// Infer the type parameter by using the arguments passed to List constructor,
scala> val l2 = List(1, 2)
// l2: List[Int] = List(1, 2)

因此,当您创建一个空列表时会发生什么情况,

// Explicitly provide the type parameter info
scala> val l1: List[Int] = List()
// l1: List[Int] = List()

    // Infer the type parameter by using the arguments passed to List constructor,
// but surprise, there are no argument since you are creating empty list
scala> val l2 = List()
// l2: List[Nothing] = List()

因此,现在Scala一无所知时,它将选择它可以找到的最合适的类型,即“空”类型Nothing

当您在其他集合对象上执行toList时,会发生同样的事情,它会尝试从源对象推断出类型参数。

scala> val ks1 = Map.empty[Int, Int].keySet
// ks1: scala.collection.immutable.Set[Int] = Set()
scala> val l1 = ks1.toList
// l1: List[Int] = List()

scala> val ks2 = Map.empty.keySet
// ks: scala.collection.immutable.Set[Nothing] = Set()
scala> val l2 = ks2.toList
// l1: List[Nothing] = List()

类似地,您在getMap(14)的{​​{1}} head上调用的Row会根据DataFrame的值来推断Map的类型参数从Row index的{​​{1}}获取。因此,如果在上述索引处没有任何内容,则返回的映射将与14的{​​{1}}相同。

这意味着你的整个身体,

Map.empty

等同于

Map[Nothing, Nothing]

因此

val colNames = customerCountDF.filter($"fiscal_year" === maxYear && $"fiscal_month" === maxMnth).head.getMap(14).keySet.toList.map(_.toString)

总结一下,任何val colNames = Map.empty.keySet.toList.map(_.toString) 只能是一个空列表。

现在,有两个问题,一个是关于scala> val l = List() // l1: List[Nothing] = List() val colNames = l.map(_.toString) 中的类型问题,另一个是有关它为空的问题。

答案 1 :(得分:1)

filter之后,您只需选择该列并获得Map,如下所示

first().getAs[Map[String, Long]]("pv").keySet

答案 2 :(得分:0)

由于您仅访问单个列(在第14位),为什么不让开发人员的工作变得更轻松(并帮助以后支持您代码的人员)?

尝试以下操作:

val colNames = customerCountDF
  .where($"fiscal_year" === maxYear)  // Split one long filter into two
  .where($"fiscal_month" === maxMnth) // where is a SQL-like alias of filter
  .select("pv")                       // Take just the field you need to work with
  .as[Map[String, Long]]              // Map it to the proper type
  .head                               // Load just the single field (all others are left aside)
  .keySet                             // That's just a pure Scala

我认为上面的代码以一种清晰的方式说明了它的作用(并且我认为这应该是所提供解决方案中最快的,因为它仅将单个pv字段加载到驱动程序上的JVM对象)

答案 3 :(得分:-1)

一种在List [String]中获得最终结果的解决方法。检查一下:

scala> val customerCountDF=Seq((2018,12,Map("subscribers_connects" -> 5521287L, "disconnects_hsd" -> 7992L, "subscribers_xfinity home" -> 6277491L, "subscribers_bulk units" -> 4978892L, "connects_cdv" -> 41464L, "connects_disconnects" -> 16945L, "connects_hsd" -> 32908L, "disconnects_internet essentials" -> 10319L, "disconnects_disconnects" -> 3506L, "disconnects_video" -> 8960L, "connects_xfinity home" -> 43012L))).toDF("fiscal_year","fiscal_month","mapc")
customerCountDF: org.apache.spark.sql.DataFrame = [fiscal_year: int, fiscal_month: int ... 1 more field]

scala> val maxYear =2018
maxYear: Int = 2018

scala> val maxMnth = 12
maxMnth: Int = 12

scala> val colNames = customerCountDF.filter($"fiscal_year" === maxYear && $"fiscal_month" === maxMnth).first.getMap(2).keySet.mkString(",").split(",").toList
colNames: List[String] = List(subscribers_connects, disconnects_hsd, subscribers_xfinity home, subscribers_bulk units, connects_cdv, connects_disconnects, connects_hsd, disconnects_internet essentials, disconnects_disconnects, disconnects_video, connects_xfinity home)

scala>