如何使用Apache Spark加载嵌套列的csv

时间:2018-01-02 22:13:06

标签: csv apache-spark apache-spark-sql spark-dataframe

我有一个csv文件:

name,age,phonenumbers
Tom,20,"[{number:100200, area_code:555},{number:100300, area_code:444}]"
Harry,20,"[{number:100400, area_code:555},{number:100500, area_code:666}]"

如何将Spark中的此文件加载到Person对象的人员的RDD /数据集中:

class Person {
    String name;
    Integer age;
    List<Phone> phonenumbers;

    class Phone {
        int number;
        int area_code; 
    }
}

1 个答案:

答案 0 :(得分:1)

不幸的是,嵌套对象的列名在示例中没有引号。真的是这样吗?因为如果他们有引号(例如格式良好的JSON),那么您可以非常轻松地使用 var context = VSS.getWebContext(); var workClient = TFS_Work.getClient(); var teamContext = { projectId: context.project.id, teamId: context.team.id, project: &quot;&quot;, team: &quot;&quot; }; _iterationId = VSS.getConfiguration().iterationId; _witClient = VSS_Service.getCollectionClient(TFS_Wit_WebApi.WorkItemTrackingHttpClient); workClient.getTeamDaysOff(teamContext, _iterationId).then(process); 函数,如下所示:

from_json

如果情况并非如此,那么您需要使用自己的逻辑将字符串转换为实际的嵌套对象,例如:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val schema = new ArrayType(new StructType()
  .add("number", IntegerType)
  .add("area_code", IntegerType), false)

val converted = input.withColumn("phones", from_json('phonenumbers, schema))