带for循环的Spark数据框:优化技术

时间:2020-05-17 16:25:16

标签: scala apache-spark apache-spark-sql

我试图实现打击逻辑。

    1. Taking some records from one table.
    2. based on resultant data I'm using one loop.
    3.then inside loop taking data from other tables in two different dataframe
    4. joining these two dataframes and loading data into 3rd table.

    var id_chck1 = s"select distinct id ,id1, id2  from table  WHERE type =  'N';
    val id_chck = hive.executeQuery(id_chck1)
    for (data <- id_chck) {

   var id = data(0)
    var id1 = data(1)
    var id2 = data(2)

      val values_1 = "select distinct bill, bil_num, id_num,  bill_date,process_date from table l WHERE id2 = '222';
      val values_1_data = hive.executeQuery(values_1)
      for (row <- values_1_data.collect) {
        val bill = row.mkString(",").split(",")(0)
        val bil_num = row.mkString(",").split(",")(1)
        val id_num= row.mkString(",").split(",")(2)
        val bill_date = row.mkString(",").split(",")(3)

        var df1 = s"select column name from tablename where id=222"
        val df1_data = hive.executeQuery(df1)
        var df2 = s"s"select column name from tablename2 where id=222""
        val df2_data = hive.executeQuery(df2)

      val df3="joining df1 and df2"
        df3.write.format("orc").mode("Append").save("hdfslocation")
      }
      var load1 = s"load data inpath 'hdfslocation' into table tablename"
      val load1_data = hive.executeUpdate(load1)

但是此过程需要6个小时以上的时间,是否有其他方法可以完成相同的操作,因此可以在短时间内完成。是否有其他方法可以完成相同的操作。提高性能。 我在test1表中有5,000万条记录。

1 个答案:

答案 0 :(得分:0)

能否请您添加输入和预期输出作为示例? 很难看到您到底要达到什么目标

相关问题