映射Spark行中数组的每个值

时间:2016-09-21 06:32:00

标签: scala apache-spark dataframe apache-spark-sql

我有一个json数据集,格式如下,每行一个条目。

<RelativeLayout 
android:layout_height="match_parent"
android:layout_width="match_parent"
android:orientation="vertical">

<FrameLayout
    android:id="@+id/home_parent_framelayout"
    android:layout_height="wrap_content"
    android:layout_width="match_parent"
    app:layout_behavior="@string/appbar_scrolling_view_behavior" />

<LinearLayout
    android:id="@+id/footer_linearlayout"
    android:layout_width="match_parent"
    android:layout_height="wrap_content"
    android:orientation="horizontal"
    android:background="#f1c21e"
    android:layout_alignParentBottom="true">

    <footer layout>

    </LinearLayout>

</RelativeLayout>

我想知道谁卖了最大的芒果等等。 因此,我想将文件加载到dataframe,并为每个事务的数组中的每个产品值发出(key,value)对(product,name)。

 { "sales_person_name" : "John", "products" : ["apple", "mango", "guava"]}
 { "sales_person_name" : "Tom", "products" : ["mango", "orange"]}
 { "sales_person_name" : "John", "products" : ["apple", "banana"]}
 { "sales_person_name" : "Steve", "products" : ["apple", "mango"]}
 { "sales_person_name" : "Tom", "products" : ["mango", "guava"]}

我无法找出爆炸()行(0)的正确方法,并且使用row(1)值发出一次所有值。任何人都可以提出建议。谢谢!

预期产出:

var df = spark.read.json("s3n://sales-data.json")
df.printSchema()
root
 |-- sales_person_name: string (nullable = true)
 |-- products: array (nullable = true)

var nameProductsMap = df.select("sales_person_name",  "products").show()
+-----------------+--------------------+
|sales_person_name|   products         |
+-----------------+--------------------+
|             John|[mango, apple,...   |
|              Tom|[mango, orange,...  |
|             John|[apple, banana...   | 

var resultMap = df.select("products", "sales_person_name")
                  .map(r => (r(1), r(0)))
                  .show()  //This is where I am stuck.

2 个答案:

答案 0 :(得分:5)

val exploded = df.explode("products", "product") { a: mutable.WrappedArray[String] => a }
val result = exploded.drop("products")
result.show()

打印:

+-----------------+-------+
|sales_person_name|product|
+-----------------+-------+
|             John|  apple|
|             John|  mango|
|             John|  guava|
|              Tom|  mango|
|              Tom| orange|
|             John|  apple|
|             John| banana|
|            Steve|  apple|
|            Steve|  mango|
|              Tom|  mango|
|              Tom|  guava|
+-----------------+-------+

答案 1 :(得分:1)

<强>更新

以下代码应该可以使用

var x2 = x0.copy();
    x2.domain(["a","b","c","d"]);

var xAxis1 = d3.svg.axis()
    .scale(x2)
    .tickSize(0)
    .orient("bottom");

svg.append("g")
   .attr("class", "x axis")
   .attr("transform", "translate(0," + (height+10) + ")")
   .call(xAxis1);

结果输出:import org.apache.spark.sql.functions.explode import scala.collection.mutable val resultMap = df.select(explode($"products"), $"sales_person_name") def counter(l: TraversableOnce[Any]) = { val temp = mutable.Map[Any, Int]() for (i <- l) { if(temp.contains(i)) temp(i) += 1 else temp(i) = 1 } temp } resultsMap.map(x => (x(0), Array(x(1)))). reduceByKey(_ ++ _). map { case (x,y) => (x, counter(y).toArray) }