Factorize Spark列

时间:2016-09-28 09:17:23

标签: scala apache-spark spark-dataframe

是否可以将Spark数据帧列分解?分解是指创建列中每个唯一值到相同ID的映射。

示例,原始数据帧:

+----------+----------------+--------------------+
|      col1|            col2|                col3|
+----------+----------------+--------------------+
|1473490929|4060600988513370|                   A|
|1473492972|4060600988513370|                   A|
|1473509764|4060600988513370|                   B|
|1473513432|4060600988513370|                   C|
|1473513432|4060600988513370|                   A|
+----------+----------------+--------------------+

到分解版本:

+----------+----------------+--------------------+
|      col1|            col2|                col3|
+----------+----------------+--------------------+
|1473490929|4060600988513370|                   0|
|1473492972|4060600988513370|                   0|
|1473509764|4060600988513370|                   1|
|1473513432|4060600988513370|                   2|
|1473513432|4060600988513370|                   0|
+----------+----------------+--------------------+

在scala本身中它会相当简单,但由于Spark在节点上分发它的数据帧,我不知道如何保持A->0, B->1, C->2的映射。

此外,假设数据帧非常大(千兆字节),这意味着可能无法将整个列加载到单个计算机的内存中。

可以吗?

2 个答案:

答案 0 :(得分:3)

您可以使用StringIndexer将字母编码为索引:

import org.apache.spark.ml.feature.StringIndexer

val indexer = new StringIndexer()
  .setInputCol("col3")
  .setOutputCol("col3Index")

val indexed = indexer.fit(df).transform(df)
indexed.show()

+----------+----------------+----+---------+
|      col1|            col2|col3|col3Index|
+----------+----------------+----+---------+
|1473490929|4060600988513370|   A|      0.0|
|1473492972|4060600988513370|   A|      0.0|
|1473509764|4060600988513370|   B|      1.0|
|1473513432|4060600988513370|   C|      2.0|
|1473513432|4060600988513370|   A|      0.0|
+----------+----------------+----+---------+

数据:

val df = spark.createDataFrame(Seq(
              (1473490929, "4060600988513370", "A"),
              (1473492972, "4060600988513370", "A"),  
              (1473509764, "4060600988513370", "B"),
              (1473513432, "4060600988513370", "C"),
              (1473513432, "4060600988513370", "A"))).toDF("col1", "col2", "col3")

答案 1 :(得分:0)

您可以使用用户定义的功能。

首先,您需要创建所需的映射:

val updateFunction = udf {(x: String) =>
  x match {
    case "A" => 0
    case "B" => 1
    case "C" => 2
    case _ => 3
  }
}

现在您只需将其应用于DataFrame

df.withColumn("col3", updateFunction(df.col("col3")))