Question

我有以下代码。

y = [y.replace("'", "") for y in y]

我的json有两个感兴趣的领域：ProductId和Quantity。我在找什么

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val baseDF = sqlContext.read.json(fileFullPath)

我想将其更改为火花RDD或DF，它有2列，产量和数量，但基于数量的多行。我想要每个数量1。

在上面的示例中，产品1有10行，产品2有1，产品3有3，产品4有5行，共19行，即＃rows = sum（quantity）。

任何帮助表示赞赏。我正在使用spark 1.6.2和scala。

Answer 1

这应该做的事情：

import org.apache.spark.sql.functions._

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._

val baseDF = sqlContext.read.json(fileFullPath)
val listFromQuantity = udf { quantity: Int => List.fill(quantity)(quantity) }

baseDF.select(explode($"sales.sale")).select($"col.productId", explode(listFromQuantity($"col.quantity"))).show()

返回：

+---------+--------+
|productId|quantity|
+---------+--------+
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        1|      10|
|        2|       1|
|        3|       3|
|        3|       3|
|        3|       3|
|        4|       5|
|        4|       5|
|        4|       5|
|        4|       5|
|        4|       5|
+---------+--------+

如果您希望第二列中包含单个数量（例如，值1而不是5），则应将List.fill(quantity)(quantity)替换为List.fill(quantity)(1)

Spark 1.6 scala创建数据行

1 个答案: