应用错误收集

如何使用scala基于列为每行创建地图？

时间：2014-11-15 07:04:24

标签： scala

我需要使用scala基于列创建每行的地图，例如

sunny,hot,high,FALSE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes

我希望输出为，

RDD[List(
  Map(
    '0 -> 'sunny,
    '1 -> 'hot,
    '2 -> 'high,
    '3 -> 'false,
    '4 -> 'no
  ),
  Map(
    '0 -> 'overcast,
    '1 -> 'hot,
    '2 -> 'high,
    '3 -> 'false,
    '4 -> 'yes
  ),
  Map(
    '0 -> 'rainy,
    '1 -> 'mild,
    '2 -> 'high,
    '3 -> 'false,
    '4 -> 'yes
  )
)]

这里我们考虑每列，列号是键，列值是键值对中的值。

1 个答案:

答案 0 :(得分：6)

Plain Scala

val s = """sunny,hot,high,FALSE,no
          |overcast,hot,high,FALSE,yes
          |rainy,mild,high,FALSE,yes""".stripMargin


s.split("\n").map { line =>
  line.split(",").zipWithIndex.map{ case (word, idx) => idx -> word}.toMap
}.toList

yields:
List(Map(0 -> sunny, 1 -> hot, 2 -> high, 3 -> FALSE, 4 -> no), 
     Map(0 -> overcast, 1 -> hot, 2 -> high, 3 -> FALSE, 4 -> yes), 
     Map(0 -> rainy, 1 -> mild, 2 -> high, 3 -> FALSE, 4 -> yes))

split在分隔符上分割文字
zipWithIndex'将'Seq映射到（值，索引）的元组

'Seq（'a'，'b'）。zipWithIndex'产生'Seq [（Char，Int）] = List（（a，0），（b，1））'

我们可以将功能改进为：

s.split("\n").map { line =>
  line.split(",").zipWithIndex.map(_.swap).toMap
}.toList

因为'zipWithIndex'的结果是Tuples，它具有函数swap所以我们不需要自己交换元素

对于Spark

sc.textFile(<file-with-data>).map { line =>
  line.split(",").zipWithIndex.map(_.swap).toMap
}

感谢@Paul