当第一行是架构时,如何在Spark中使用csv创建数据框(使用scala)?

时间:2015-07-14 04:46:55

标签: scala csv apache-spark hdfs dataframe

我是Spark的新手,我正在使用scala进行编码。我想从HDFS或S3读取文件并将其转换为Spark Data框架。 Csv文件的第一行是架构。但是如何使用具有未知列的模式创建数据框? 我使用以下代码为已知模式创建数据框。

[vagrant@localhost horizon]$ git remote rm origin
[vagrant@localhost horizon]$ git remote add origin ssh://git@bitbucket.org/user_name/repo_name.git
[vagrant@localhost horizon]$ git push -u origin master
Warning: Permanently added the RSA host key for IP address '131.103.20.167' to the list of known hosts.
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

}

1 个答案:

答案 0 :(得分:4)

您可以尝试使用以下代码亲爱的hammad

val sc = new SparkContext(new SparkConf().setMaster("local").setAppName("test"))
val sqlcon = new SQLContext(sc)
//comma separated list of columnName:type

def main(args:Array[String]){
var schemaString ="Id:int,FirstName:text,LastName:text,Email:string,Country:text"
val schema =
      StructType(
        schemaString.split(",").map(fieldName => StructField(fieldName.split(":")(0),
          getFieldTypeInSchema(fieldName.split(":")(1)), true)))
val rdd=sc.textFile("/users.csv")
val noHeader = rdd.mapPartitionsWithIndex( 
(i, iterator) => 
  if (i == 0 && iterator.hasNext) { 
     iterator.next 
     iterator 
    } else iterator)
 val rowRDDx =noHeader.map(p => {
      var list: collection.mutable.Seq[Any] = collection.mutable.Seq.empty[Any]
      var index = 0
      var tokens = p.split(",")
      tokens.foreach(value => {
        var valType = schema.fields(index).dataType
        var returnVal: Any = null
        valType match {
          case IntegerType => returnVal = value.toString.toInt
          case DoubleType => returnVal = value.toString.toDouble
          case LongType => returnVal = value.toString.toLong
          case FloatType => returnVal = value.toString.toFloat
          case ByteType => returnVal = value.toString.toByte
          case StringType => returnVal = value.toString
          case TimestampType => returnVal = value.toString
        }
        list = list :+ returnVal
        index += 1
      })
      Row.fromSeq(list)
    })
val df = sqlcon.applySchema(rowRDDx, schema)
}
def getFieldTypeInSchema(ftype: String): DataType = {

    ftype match {
      case "int" => return IntegerType
      case "double" => return DoubleType
      case "long" => return LongType
      case "float" => return FloatType
      case "byte" => return ByteType
      case "string" => return StringType
      case "date" => return TimestampType
      case "timestamp" => return StringType
      case "uuid" => return StringType
      case "decimal" => return DoubleType
      case "boolean" => BooleanType
      case "counter" => IntegerType
      case "bigint" => IntegerType
      case "text" => return StringType
      case "ascii" => return StringType
      case "varchar" => return StringType
      case "varint" => return IntegerType
      case default => return StringType
    }
  }

希望它会对你有所帮助。 :)