Question

Spark Newbie在这里尝试使用Spark来做一些ETL，并且无法找到一种将数据统一到目标方案的简洁方法。

我在火花上下文（流媒体）中有多个具有这些键/值的数据帧

长值的数据框：

entry---------|long---------
----------------------------
alert_refresh |1446668689989
assigned_on   |1446668689777

字符串值的数据框

entry---------|string-------
----------------------------
statusmsg     |alert msg
url           |http:/svr/pth

布尔值的数据框

entry---------|boolean------
----------------------------
led_1         |true
led_2         |true

整数值的数据帧：

entry---------|int----------
----------------------------
id            |789456123

我需要根据这些键创建一个统一的数据框，其中键是fieldName，它维护每个源数据帧的类型。它看起来像这样：

id-------|led_1|led_2|statusmsg|url----------|alert_refresh|assigned_on
-----------------------------------------------------------------------
789456123|true |true |alert msg|http:/svr/pth|1446668689989|1446668689777

在Spark中执行此操作的最有效方法是什么？

BTW - 我尝试进行矩阵变换：

val seq_b= df_booleans.flatMap(row => (row.toSeq.map(col => (col, row.toSeq.indexOf(col))))) 
 .map(v => (v._2, v._1)) 
 .groupByKey.sortByKey(true) 
 .map(._2) 

val b_schema_names = seq_b.first.flatMap(r => Array(r)) 
val b_schema = StructType(b_schema_names.map(r => StructField(r.toString(), BooleanType, true)))
val b_data = seq_b.zipWithIndex.filter(._2==1).map(_._1).first() 
val boolean_df = sparkContext.createDataFrame(b_data, b_schema)

问题：需要12秒，而.sortByKey（true）并不总是最后对值进行排序

SPARK：采用KV对并将其转换为类型化数据帧的最有效方法是什么

0 个答案: