Spark RDD到CSV - 添加空列

时间:2015-06-30 12:06:07

标签: csv apache-spark

我有一个RDD [Map [String,Int]],其中地图的键是列名。每个映射都是不完整的,要知道我需要将所有键联合起来的列名。有没有办法避免这个收集操作知道所有的密钥,只使用一次rdd.saveAsTextFile(..)来获取csv?

例如,假设我有一个带有两个元素的RDD(scala表示法):

Map("a"->1, "b"->2)
Map("b"->1, "c"->3)

我想结束这个csv:

a,b,c
1,2,0
0,1,3

Scala解决方案更好,但任何其他与Spark兼容的语言都可以。

编辑:

我也可以尝试从另一个方向解决我的问题。假设我以某种方式知道开头的所有列,但我想摆脱所有映射中具有0值的列。所以问题就变成了,我知道键是(“a”,“b”,“c”)并且由此得出:

Map("a"->1, "b"->2, "c"->0)
Map("a"->3, "b"->1, "c"->0)

我需要写csv:

a,b
1,2
3,1

是否可以只使用一个收集来执行此操作?

2 个答案:

答案 0 :(得分:2)

如果您的声明是:“我的RDD中的每个新元素都可能添加一个我迄今未见过的新列名称”,答案显然无法避免全面扫描。但是您不需要收集驱动程序上的所有元素。

您可以使用aggregate仅收集列名称。此方法有两个函数,一个是将单个元素插入到结果集合中,另一个是合并来自两个不同分区的结果。

rdd.aggregate(Set.empty[String])( {(s, m) => s union m.keySet }, { (s1, s2) => s1 union s2 })

您将在RDD中找回一组所有列名称。在第二次扫描中,您可以打印CSV文件。

答案 1 :(得分:1)

Scala and any other supported language

You can use spark-csv

First lets find all present columns:

val cols = sc.broadcast(rdd.flatMap(_.keys).distinct().collect())

Create RDD[Row]:

val rows = rdd.map {
    row => { Row.fromSeq(cols.value.map { row.getOrElse(_, 0) })}
}

Prepare schema:

import org.apache.spark.sql.types.{StructType, StructField, IntegerType}

val schema = StructType(
    cols.value.map(field => StructField(field, IntegerType, true)))

Convert RDD[Row] to Data Frame:

val df = sqlContext.createDataFrame(rows, schema)

Write results:

// Spark 1.4+, for other versions see spark-csv docs
df.write.format("com.databricks.spark.csv").save("mycsv.csv")

You can do pretty much the same thing using other supported languages.

Python

If you use Python and final data fits in a driver memory you can use Pandas through toPandas() method:

rdd = sc.parallelize([{'a': 1, 'b': 2}, {'b': 1, 'c': 3}])
cols = sc.broadcast(rdd.flatMap(lambda row: row.keys()).distinct().collect())

df = sqlContext.createDataFrame(
    rdd.map(lambda row: {k: row.get(k, 0) for k in cols.value}))

df.toPandas().save('mycsv.csv')

or directly:

import pandas as pd 
pd.DataFrame(rdd.collect()).fillna(0).save('mycsv.csv')

Edit

One possible way to the second collect is to use accumulators to either build a set of all column names or to count these where you found zeros and use this information to map over rows and remove unnecessary columns or to add zeros.

It is possible but inefficient and feels like cheating. The only situation when it makes some sense is when number of zeros is very low, but I guess it is not the case here.

object ColsSetParam extends AccumulatorParam[Set[String]] {

  def zero(initialValue: Set[String]): Set[String] = {
    Set.empty[String]
  }

  def addInPlace(s1: Set[String], s2: Set[String]): Set[String] = {
    s1 ++ s2
  }
}

val colSetAccum = sc.accumulator(Set.empty[String])(ColsSetParam)
rdd.foreach { colSetAccum += _.keys.toSet } 

or

// We assume you know this upfront
val allColnames = sc.broadcast(Set("a", "b", "c"))

object ZeroColsParam extends AccumulatorParam[Map[String, Int]] {

  def zero(initialValue: Map[String, Int]): Map[String, Int] = {
    Map.empty[String, Int]
  }

  def addInPlace(m1: Map[String, Int], m2: Map[String, Int]): Map[String, Int] = {
    val keys = m1.keys ++ m2.keys
    keys.map(
      (k: String) => (k -> (m1.getOrElse(k, 0) + m2.getOrElse(k, 0)))).toMap
  }
}

val accum = sc.accumulator(Map.empty[String, Int])(ZeroColsParam)

rdd.foreach { row =>
  // If allColnames.value -- row.keys.toSet is empty we can avoid this part
  accum += (allColnames.value -- row.keys.toSet).map(x => (x -> 1)).toMap
}