使用Apache Spark和Java将CSV解析为DataFrame / DataSet

时间:2014-08-18 12:07:52

标签: java apache-spark hadoop apache-spark-sql hdfs

我是新来的火花,我想使用group-by&减少以从CSV中找到以下内容(使用一行):

  Department, Designation, costToCompany, State
  Sales, Trainee, 12000, UP
  Sales, Lead, 32000, AP
  Sales, Lead, 32000, LA
  Sales, Lead, 32000, TN
  Sales, Lead, 32000, AP
  Sales, Lead, 32000, TN 
  Sales, Lead, 32000, LA
  Sales, Lead, 32000, LA
  Marketing, Associate, 18000, TN
  Marketing, Associate, 18000, TN
  HR, Manager, 58000, TN

我想通过部门,指定,州简化关于CSV的CSV,其中包含 sum(costToCompany) TotalEmployeeCount 的其他列

应得到如下结果:

  Dept, Desg, state, empCount, totalCost
  Sales,Lead,AP,2,64000
  Sales,Lead,LA,3,96000  
  Sales,Lead,TN,2,64000

有没有办法使用转换和动作来实现这一目标。或者我们应该去做RDD操作吗?

4 个答案:

答案 0 :(得分:39)

程序

  • 创建一个类(Schema)来封装你的结构(它不是方法B所必需的,但如果你使用Java,它会让你的代码更容易阅读)

    public class Record implements Serializable {
      String department;
      String designation;
      long costToCompany;
      String state;
      // constructor , getters and setters  
    }
    
  • 加载CVS(JSON)文件

    JavaSparkContext sc;
    JavaRDD<String> data = sc.textFile("path/input.csv");
    //JavaSQLContext sqlContext = new JavaSQLContext(sc); // For previous versions 
    SQLContext sqlContext = new SQLContext(sc); // In Spark 1.3 the Java API and Scala API have been unified
    
    
    JavaRDD<Record> rdd_records = sc.textFile(data).map(
      new Function<String, Record>() {
          public Record call(String line) throws Exception {
             // Here you can use JSON
             // Gson gson = new Gson();
             // gson.fromJson(line, Record.class);
             String[] fields = line.split(",");
             Record sd = new Record(fields[0], fields[1], fields[2].trim(), fields[3]);
             return sd;
          }
    });
    

此时您有两种方法:

一种。 SparkSQL

  • 注册表(使用您定义的架构类)

    JavaSchemaRDD table = sqlContext.applySchema(rdd_records, Record.class);
    table.registerAsTable("record_table");
    table.printSchema();
    
  • 使用所需的Query-group-by

    查询表格
    JavaSchemaRDD res = sqlContext.sql("
      select department,designation,state,sum(costToCompany),count(*) 
      from record_table 
      group by department,designation,state
    ");
    
  • 在这里,您还可以使用SQL方法执行您想要的任何其他查询

B中。火花

  • 使用复合键进行映射:DepartmentDesignationState

    JavaPairRDD<String, Tuple2<Long, Integer>> records_JPRDD = 
    rdd_records.mapToPair(new
      PairFunction<Record, String, Tuple2<Long, Integer>>(){
        public Tuple2<String, Tuple2<Long, Integer>> call(Record record){
          Tuple2<String, Tuple2<Long, Integer>> t2 = 
          new Tuple2<String, Tuple2<Long,Integer>>(
            record.Department + record.Designation + record.State,
            new Tuple2<Long, Integer>(record.costToCompany,1)
          );
          return t2;
    }
    

    });

  • 使用复合键reduceByKey,求和costToCompany列,并按键累计记录数

    JavaPairRDD<String, Tuple2<Long, Integer>> final_rdd_records = 
     records_JPRDD.reduceByKey(new Function2<Tuple2<Long, Integer>, Tuple2<Long,
     Integer>, Tuple2<Long, Integer>>() {
        public Tuple2<Long, Integer> call(Tuple2<Long, Integer> v1,
        Tuple2<Long, Integer> v2) throws Exception {
            return new Tuple2<Long, Integer>(v1._1 + v2._1, v1._2+ v2._2);
        }
    });
    

答案 1 :(得分:18)

  

可以使用Spark内置CSV阅读器解析CSV文件。它会回来   DataFrame / DataSet成功读取文件。在之上   DataFrame / DataSet,您可以轻松应用类似SQL的操作。

将Spark 2.x(及以上版本)与Java

一起使用

创建SparkSession对象aka spark

import org.apache.spark.sql.SparkSession;

SparkSession spark = SparkSession
    .builder()
    .appName("Java Spark SQL Example")
    .getOrCreate();

使用StructType

为行创建架构
import org.apache.spark.sql.types.StructType;

StructType schema = new StructType()
    .add("department", "string")
    .add("designation", "string")
    .add("ctc", "long")
    .add("state", "string");

从CSV文件创建数据框并将架构应用于

Dataset<Row> df = spark.read()
    .option("mode", "DROPMALFORMED")
    .schema(schema)
    .csv("hdfs://path/input.csv");

more option on reading data from CSV file

现在我们可以通过2种方式聚合数据

  

1。 SQL方式

     

在spark sql metastore中注册表以执行SQL操作

df.createOrReplaceTempView("employee");
     

在已注册的数据框上运行SQL查询

Dataset<Row> sqlResult = spark.sql(
    "SELECT department, designation, state, SUM(ctc), COUNT(department)" 
        + " FROM employee GROUP BY department, designation, state");

sqlResult.show(); //for testing
     

我们甚至可以execute SQL directly on CSV file with out creating table with Spark SQL

  

2。对象链或编程或类似Java的方式

     

为sql函数执行必要的导入

import static org.apache.spark.sql.functions.count;
import static org.apache.spark.sql.functions.sum;
     

在数据框/数据集上使用groupByagg来执行count和   关于数据的sum

Dataset<Row> dfResult = df.groupBy("department", "designation", "state")
    .agg(sum("ctc"), count("department"));
// After Spark 1.6 columns mentioned in group by will be added to result by default

dfResult.show();//for testing

依赖库

"org.apache.spark" % "spark-core_2.11" % "2.0.0" 
"org.apache.spark" % "spark-sql_2.11" % "2.0.0"

答案 2 :(得分:4)

以下可能不完全正确,但它应该让您了解如何处理数据。它不漂亮,应该用case类等替换,但作为如何使用spark api的一个简单例子,我希望它足够了:)

val rawlines = sc.textfile("hdfs://.../*.csv")
case class Employee(dep: String, des: String, cost: Double, state: String)
val employees = rawlines
  .map(_.split(",") /*or use a proper CSV parser*/
  .map( Employee(row(0), row(1), row(2), row(3) )

# the 1 is the amount of employees (which is obviously 1 per line)
val keyVals = employees.map( em => (em.dep, em.des, em.state), (1 , em.cost))

val results = keyVals.reduceByKey{ a,b =>
    (a._1 + b._1, b._1, b._2) # (a.count + b.count , a.cost + b.cost )
}

#debug output
results.take(100).foreach(println)

results
  .map( keyval => someThingToFormatAsCsvStringOrWhatever )
  .saveAsTextFile("hdfs://.../results")

或者您可以使用SparkSQL:

val sqlContext = new SQLContext(sparkContext)

# case classes can easily be registered as tables
employees.registerAsTable("employees")

val results = sqlContext.sql("""select dep, des, state, sum(cost), count(*) 
  from employees 
  group by dep,des,state"""

答案 3 :(得分:4)

对于JSON,如果您的文本文件每行包含一个JSON对象,则可以使用sqlContext.jsonFile(path)让Spark SQL将其加载为SchemaRDD(将自动推断架构)。然后,您可以将其注册为表并使用SQL进行查询。您还可以手动将文本文件加载为每个记录包含一个JSON对象的RDD[String],并使用sqlContext.jsonRDD(rdd)将其转换为SchemaRDD。当您需要预处理数据时,jsonRDD非常有用。