如何在Scala中为列中的项目分配键?

时间:2018-10-06 16:25:49

标签: scala apache-spark mapreduce rdd

我有以下RDD:

 Col1     Col2
"abc"    "123a"
"def"    "783b"
"abc     "674b"
"xyz"    "123a"
"abc"    "783b"

我需要以下输出,其中每列中的每个项目都转换为唯一键。 for example : abc->1,def->2,xyz->3

Col1      Col2
1          1
2          2
1          3
3          1
1          2

任何帮助将不胜感激。谢谢!

2 个答案:

答案 0 :(得分:0)

在这种情况下,您可以依赖字符串的hashCode。如果输入和数据类型相同,则哈希码将相同。试试这个。

scala> "abc".hashCode
res23: Int = 96354

scala> "xyz".hashCode
res24: Int = 119193

scala> val df = Seq(("abc","123a"),
     | ("def","783b"),
     | ("abc","674b"),
     | ("xyz","123a"),
     | ("abc","783b")).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]

scala>

scala> def hashc(x:String):Int =
     | return x.hashCode
hashc: (x: String)Int

scala> val myudf = udf(hashc(_:String):Int)
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(StringType)))

scala> df.select(myudf('col1), myudf('col2)).show
+---------+---------+
|UDF(col1)|UDF(col2)|
+---------+---------+
|    96354|  1509487|
|    99333|  1694000|
|    96354|  1663279|
|   119193|  1509487|
|    96354|  1694000|
+---------+---------+


scala>

答案 1 :(得分:0)

如果必须将列从1开始映射到natural numbers,则一种方法是将zipWithIndex应用于各个列,将1加到索引(因为zipWithIndex总是从0),将单个RDD转换为DataFrame,最后将转换后的DataFrame加入索引键:

val rdd = sc.parallelize(Seq(
  ("abc", "123a"),
  ("def", "783b"),
  ("abc", "674b"),
  ("xyz", "123a"),
  ("abc", "783b")
))

val df1 = rdd.map(_._1).distinct.zipWithIndex.
  map(r => (r._1, r._2 + 1)).
  toDF("col1", "c1key")

val df2 = rdd.map(_._2).distinct.zipWithIndex.
  map(r => (r._1, r._2 + 1)).
  toDF("col2", "c2key")

val dfJoined = rdd.toDF("col1", "col2").
  join(df1, Seq("col1")).
  join(df2, Seq("col2"))
// +----+----+-----+-----+
// |col2|col1|c1key|c2key|
// +----+----+-----+-----+
// |783b| abc|    2|    1|
// |783b| def|    3|    1|
// |123a| xyz|    1|    2|
// |123a| abc|    2|    2|
// |674b| abc|    2|    3|
//+----+----+-----+-----+

dfJoined.
  select($"c1key".as("col1"), $"c2key".as("col2")).
  show
// +----+----+
// |col1|col2|
// +----+----+
// |   2|   1|
// |   3|   1|
// |   1|   2|
// |   2|   2|
// |   2|   3|
// +----+----+

请注意,如果您可以将键从0开始,可以在生成map(r => (r._1, r._2 + 1))df1时跳过df2的步骤。