在appendmode中编写Parquet文件-上一个任务连续运行

时间:2018-08-09 22:28:59

标签: java apache-spark hadoop left-join parquet

我要加入三个数据框并以追加模式编写镶木地板文件。我看到的最后一个任务正在持续运行,而且似乎永无止境。所有任务正在读取几乎相同大小的数据。

我也使用了重新分区,以避免数据倾斜。但没有运气。

请提出建议,我该怎么做才能改善效果。

配置详细信息

spark.yarn.executor.memoryOverhead: 8000

spark.shuffle.blockTransferService: netty

spark.scheduler.mode:   FAIR

spark.port.maxRetries:  200

spark.network.timeout:  800

spark.executor.memory:  12g

spark.executor.instances    40

spark.executor.extraJavaOptions:    -XX:+UseG1GC -XX:SurvivorRatio=16 -XX:MaxPermSize=128m

spark.executor.cores:   4

Number of Nodes in cluster: 48 (Each node having around 240 GB ram)

编写代码

    //Set sql context
    ctx.hadoopConfiguration().set("parquet.enable.summary-metadata", "false");
    SQLContext sqlContext = new org.apache.spark.sql.SQLContext(ctx);

    //Data frame for BASE-PARENT Data 
    DataFrame df_parent_base = sqlContext.read().format("com.databricks.spark.avro").load("/data/prod/fin/base/source/stream/ing_year={"+context.v_ing_year+"}/ing_month={"+context.v_ing_month+"}/ing_day={"+context.v_ing_day+"}/ing_hour={"+context.v_ing_hour+"}/*.avro").select("event_day","event_hour","event_month","event_year","ing_year", "ing_month", "ing_day", "ing_hour","job_instance_id").distinct().persist((org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER()));

    //Collecting data to get the "filter" for Access Data
    Row[] df_parent_base_collect= df_parent_base.select("job_instance_id", "event_year","event_month","event_day","event_hour").collect();

    String jobinstanceid="";
    String event_year="";
    String event_month="";
    String event_day="";
    String event_hour="";

    HashSet<String> jobinstanceidset = new HashSet<String>();
    HashSet<Integer> yearset = new HashSet<Integer>();
    HashSet<Integer> monthset = new HashSet<Integer>();
    HashSet<Integer> dayset = new HashSet<Integer>();
    HashSet<Integer> hourset = new HashSet<Integer>();

    for(Row t: df_parent_base_collect){
    jobinstanceidset.add(t.getString(0));
    yearset.add(t.getInt(1));
    monthset.add(t.getInt(2));
    dayset.add(t.getInt(3));
    hourset.add(t.getInt(4));
    }

    for (String val:jobinstanceidset ) {
          jobinstanceid += "," + val;
    }

    for (Integer val:yearset ) {
          event_year += "," + val;
    }

    for (Integer val:monthset ) {
          event_month += "," + val;
    }

    for (Integer val:dayset ) {
          event_day += "," + val;
    }

    for (Integer val:hourset ) {
          event_hour += "," + val;
    }

    jobinstanceid = jobinstanceid.substring(jobinstanceid.indexOf(",")+1);
    event_year = event_year.substring(event_year.indexOf(",")+1);
    event_month = event_month.substring(event_month.indexOf(",")+1);
    event_day = event_day.substring(event_day.indexOf(",")+1);
    event_hour = event_hour.substring(event_hour.indexOf(",")+1);


    //Data frame for ACCESS-PARENT Data, using filter obtained from collect
    DataFrame df_parent_access = sqlContext.read().parquet("/data/prod/fin/access/source/stream/event_year={"+event_year+"}/event_month={"+event_month+"}/event_day={"+event_day+"}/event_hour={"+event_hour+"}/*.parquet").select("time","event_datetime_EET","md5ueidentityimsi","ueidentityimei","ueidentityimeitac","msisdn","md5msisdn","locationinfotac","locationinfoeci","rattype_lu","psn_key","ueidentityimsi","job_instance_id","event_year","event_month","event_day","event_hour");


    //Join above two dataframes to get required columns from ACCESS-PARENT
    DataFrame df_join_parent_tmp = df_parent_access.join(df_parent_base, (df_parent_access.col("job_instance_id").equalTo(df_parent_base.col("job_instance_id"))).and(df_parent_access.col("event_year").equalTo(df_parent_base.col("event_year"))).and(df_parent_access.col("event_month").equalTo(df_parent_base.col("event_month"))).and(df_parent_access.col("event_day").equalTo(df_parent_base.col("event_day"))).and(df_parent_access.col("event_hour").equalTo(df_parent_base.col("event_hour"))),"inner").select(df_parent_access.col("*"));

    //Repartitioning above dataframe on imsi,psn and time
    DataFrame df_join_parent = df_join_parent_tmp.repartition(df_join_parent_tmp.col("ueidentityimsi"),df_join_parent_tmp.col("psn_key"),df_join_parent_tmp.col("time"));


    //Data frame for CHILD-data
    DataFrame df_child_base_tmp = sqlContext.read().format("com.databricks.spark.avro").load("/data/prod/fin/base/source/childstream_tmp/*/*/*.avro");

    //Repartitioning on imsi,psn and time
    DataFrame df_child_base = df_child_base_tmp.repartition(df_child_base_tmp.col("imsi"),df_child_base_tmp.col("telia_psn"),df_child_base_tmp.col("time"));


    //Dataframe for Data Enrichment (parent, child join)
    DataFrame df_join_parent_child_tmp = df_join_parent.join(df_child_base, (df_join_parent.col("time").equalTo(df_child_base.col("time"))).and(df_join_parent.col("psn_key").equalTo(df_child_base.col("telia_psn"))).and(df_join_parent.col("ueidentityimsi").equalTo(df_child_base.col("imsi"))),"inner").select(df_join_parent.col("time"),df_join_parent.col("event_datetime_EET"),df_child_base.col("telia_psn").as("psn_key"),df_child_base.col("telia_pdnconnectionsmobileipaddr").as("pdnconnectionsmobileipaddr"),df_child_base.col("telia_pdnconnectionsmobileipv6addr").as("pdnconnectionsmobileipv6addr"),df_child_base.col("telia_pdnconnectionsapn").as("pdnconnectionsapn"),df_child_base.col("telia_pdnconnectionsapnambrdl").as("pdnconnectionsapnambrdl"),df_child_base.col("telia_pdnconnectionsapnambrul").as("pdnconnectionsapnambrul"),df_child_base.col("telia_pdnconnectionsdefaultbearerebi").as("pdnconnectionsdefaultbearerebi"),df_child_base.col("telia_pdnconnectionsselectionmode").as("pdnconnectionsselectionmode"),df_child_base.col("telia_pdnbearersebi").as("pdnbearersebi"),df_child_base.col("telia_pdnbearerslinkedebi").as("pdnbearerslinkedebi"),df_child_base.col("telia_pdnbearersbeareresttime").as("pdnbearersbeareresttime"),df_child_base.col("telia_pdnbearersdluserplanefteidinterface").as("pdnbearersdluserplanefteidinterface"),df_child_base.col("telia_pdnbearersdluserplanefteidteid").as("pdnbearersdluserplanefteidteid"),df_child_base.col("telia_pdnbearersdluserplanefteidip").as("pdnbearersdluserplanefteidip"),df_child_base.col("telia_pdnbearersuluserplanefteidinterface").as("pdnbearersuluserplanefteidinterface"),df_child_base.col("telia_pdnbearersuluserplanefteidteid").as("pdnbearersuluserplanefteidteid"),df_child_base.col("telia_pdnbearersuluserplanefteidip").as("pdnbearersuluserplanefteidip"),df_child_base.col("telia_pdnbearersqosqci").as("pdnbearersqosqci"),df_child_base.col("telia_pdnbearersprioritylevel").as("pdnbearersprioritylevel"),df_child_base.col("telia_pdnbearerspreemptioncapability").as("pdnbearerspreemptioncapability"),df_child_base.col("telia_pdnbearerspreemptionvulnerability").as("pdnbearerspreemptionvulnerability"),df_child_base.col("telia_pdnbearersgbrmaxbitratedl").as("pdnbearersgbrmaxbitratedl"),df_child_base.col("telia_pdnbearersgbrmaxbitrateul").as("pdnbearersgbrmaxbitrateul"),df_child_base.col("telia_pdnbearersgbrguaranteedbitratedl").as("pdnbearersgbrguaranteedbitratedl"),df_child_base.col("telia_pdnbearersgbrguaranteedbitrateul").as("pdnbearersgbrguaranteedbitrateul"),df_child_base.col("imsi"),df_join_parent.col("md5ueidentityimsi"),df_join_parent.col("ueidentityimei"),df_join_parent.col("ueidentityimeitac"),df_join_parent.col("msisdn"),df_join_parent.col("md5msisdn"),df_join_parent.col("rattype_lu"),df_join_parent.col("locationinfotac"),df_join_parent.col("locationinfoeci"),df_child_base.col("event_year"),df_child_base.col("event_month"),df_child_base.col("event_day"),df_child_base.col("event_hour"),df_child_base.col("ingestion_timestamp"),df_child_base.col("child_instance_id").as("job_instance_id"));

    //Adding SALT column with random numbers
    DataFrame df_join_parent_child = df_join_parent_child_tmp.withColumn("salt",org.apache.spark.sql.functions.lit(Numeric.random(0,100)));


    //Writing parquet file
    df_join_parent_child.repartition(100,df_join_parent_child.col("salt"),df_join_parent_child.col("event_year"),df_join_parent_child.col("event_month"),df_join_parent_child.col("event_day"),df_join_parent_child.col("event_hour")).drop("salt").write().format("parquet").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("event_year","event_month","event_day","event_hour").save("/data/prod/fin/access/source/childstream/");


    System.out.println("**** DONE");

**PHYSICAL PLAN OF LAST STAGE** :

== Physical Plan ==
Project [time#225L,event_datetime_EET#208,telia_psn#276L AS psn_key#280L,telia_pdnconnectionsmobileipaddr#273 AS pdnconnectionsmobileipaddr#281,telia_pdnconnectionsmobileipv6addr#274 AS pdnconnectionsmobileipv6addr#282,telia_pdnconnectionsapn#269 AS pdnconnectionsapn#283,telia_pdnconnectionsapnambrdl#270L AS pdnconnectionsapnambrdl#284L,telia_pdnconnectionsapnambrul#271L AS pdnconnectionsapnambrul#285L,telia_pdnconnectionsdefaultbearerebi#272 AS pdnconnectionsdefaultbearerebi#286,telia_pdnconnectionsselectionmode#275 AS pdnconnectionsselectionmode#287,telia_pdnbearersebi#256 AS pdnbearersebi#288,telia_pdnbearerslinkedebi#261 AS pdnbearerslinkedebi#289,telia_pdnbearersbeareresttime#252L AS pdnbearersbeareresttime#290L,telia_pdnbearersdluserplanefteidinterface#253 AS pdnbearersdluserplanefteidinterface#291,telia_pdnbearersdluserplanefteidteid#255L AS pdnbearersdluserplanefteidteid#292L,telia_pdnbearersdluserplanefteidip#254 AS pdnbearersdluserplanefteidip#293,telia_pdnbearersuluserplanefteidinterface#266 AS pdnbearersuluserplanefteidinterface#294,telia_pdnbearersuluserplanefteidteid#268L AS pdnbearersuluserplanefteidteid#295L,telia_pdnbearersuluserplanefteidip#267 AS pdnbearersuluserplanefteidip#296,telia_pdnbearersqosqci#265 AS pdnbearersqosqci#297,telia_pdnbearersprioritylevel#264 AS pdnbearersprioritylevel#298,telia_pdnbearerspreemptioncapability#262 AS pdnbearerspreemptioncapability#299,telia_pdnbearerspreemptionvulnerability#263 AS pdnbearerspreemptionvulnerability#300,telia_pdnbearersgbrmaxbitratedl#259L AS pdnbearersgbrmaxbitratedl#301L,telia_pdnbearersgbrmaxbitrateul#260L AS pdnbearersgbrmaxbitrateul#302L,telia_pdnbearersgbrguaranteedbitratedl#257L AS pdnbearersgbrguaranteedbitratedl#303L,telia_pdnbearersgbrguaranteedbitrateul#258L AS pdnbearersgbrguaranteedbitrateul#304L,imsi#250L,md5ueidentityimsi#217,ueidentityimei#230L,ueidentityimeitac#231,msisdn#218L,md5msisdn#216,rattype_lu#221,locationinfotac#215,locationinfoeci#212,ingestion_timestamp#251,child_instance_id#245 AS job_instance_id#305,event_year#249,event_month#248,event_day#246,event_hour#247]
+- SortMergeJoin [time#225L,psn_key#219L,ueidentityimsi#232L], [time#277L,telia_psn#276L,imsi#250L]
   :- Sort [time#225L ASC,psn_key#219L ASC,ueidentityimsi#232L ASC], false, 0
   :  +- TungstenExchange hashpartitioning(time#225L,psn_key#219L,ueidentityimsi#232L,200), None
   :     +- Project [psn_key#219L,rattype_lu#221,ueidentityimeitac#231,ueidentityimsi#232L,msisdn#218L,locationinfoeci#212,ueidentityimei#230L,md5ueidentityimsi#217,time#225L,locationinfotac#215,md5msisdn#216,event_datetime_EET#208]
   :        +- TungstenExchange hashpartitioning(ueidentityimsi#232L,psn_key#219L,time#225L,200), None
   :           +- Project [time#225L,event_datetime_EET#208,md5ueidentityimsi#217,ueidentityimei#230L,ueidentityimeitac#231,msisdn#218L,md5msisdn#216,locationinfotac#215,locationinfoeci#212,rattype_lu#221,psn_key#219L,ueidentityimsi#232L,job_instance_id#211,event_year#241,event_month#242,event_day#243,event_hour#244]
   :              +- BroadcastHashJoin [job_instance_id#211,event_year#241,event_month#242,event_day#243,event_hour#244], [job_instance_id#54,event_year#52,event_month#51,event_day#49,event_hour#50], BuildRight
   :                 :- Scan ParquetRelation[time#225L,event_datetime_EET#208,md5ueidentityimsi#217,ueidentityimei#230L,ueidentityimeitac#231,msisdn#218L,md5msisdn#216,locationinfotac#215,locationinfoeci#212,rattype_lu#221,psn_key#219L,ueidentityimsi#232L,job_instance_id#211,event_year#241,event_month#242,event_day#243,event_hour#244] InputPaths: hdfs://nameservice1/data/prod/fin/access/source/stream/event_year=2018/event_month=8/event_day=9/event_hour=10/part-r-00000-0bb42667-10e0-4330-b9b2-af54e4cac346.snappy.parquet, hdfs://nameservice1/data/prod/fin/access/source/stream/event_year=2018/event_month=8/event_day=9/event_hour=10/part-r-00000-86f81dc9-e50f-4bd2-9fcc-210f033dd4a5.snappy.parquet 
   :                 +- InMemoryColumnarTableScan [event_day#49,event_year#52,job_instance_id#54,event_month#51,event_hour#50], InMemoryRelation [event_day#49,event_hour#50,event_month#51,event_year#52,ing_year#90,ing_month#91,ing_day#92,ing_hour#93,job_instance_id#54], true, 10000, StorageLevel(true, true, false, false, 1), TungstenAggregate(key=[event_day#49,event_hour#50,event_month#51,event_year#52,ing_year#90,ing_month#91,ing_day#92,ing_hour#93,job_instance_id#54], functions=[], output=[event_day#49,event_hour#50,event_month#51,event_year#52,ing_year#90,ing_month#91,ing_day#92,ing_hour#93,job_instance_id#54]), None
   +- Sort [time#277L ASC,telia_psn#276L ASC,imsi#250L ASC], false, 0
      +- TungstenExchange hashpartitioning(time#277L,telia_psn#276L,imsi#250L,200), None
         +- Project [event_hour#247,telia_pdnbearersprioritylevel#264,telia_pdnconnectionsmobileipaddr#273,telia_pdnconnectionsapn#269,telia_pdnbearersdluserplanefteidinterface#253,telia_pdnbearersgbrmaxbitrateul#260L,time#277L,telia_pdnconnectionsmobileipv6addr#274,telia_pdnbearerspreemptioncapability#262,telia_pdnbearersebi#256,telia_pdnbearerspreemptionvulnerability#263,event_year#249,telia_pdnbearersgbrmaxbitratedl#259L,telia_pdnbearersbeareresttime#252L,telia_pdnbearersuluserplanefteidinterface#266,telia_pdnbearersgbrguaranteedbitrateul#258L,telia_pdnbearersuluserplanefteidip#267,telia_pdnconnectionsselectionmode#275,imsi#250L,telia_psn#276L,event_day#246,telia_pdnconnectionsdefaultbearerebi#272,telia_pdnbearerslinkedebi#261,event_month#248,telia_pdnbearersqosqci#265,ingestion_timestamp#251,telia_pdnbearersdluserplanefteidteid#255L,telia_pdnbearersgbrguaranteedbitratedl#257L,telia_pdnbearersuluserplanefteidteid#268L,telia_pdnconnectionsapnambrdl#270L,telia_pdnbearersdluserplanefteidip#254,telia_pdnconnectionsapnambrul#271L,child_instance_id#245]
            +- TungstenExchange hashpartitioning(imsi#250L,telia_psn#276L,time#277L,200), None
               +- Scan AvroRelation[child_instance_id#245,event_day#246,event_hour#247,event_month#248,event_year#249,imsi#250L,ingestion_timestamp#251,telia_pdnbearersbeareresttime#252L,telia_pdnbearersdluserplanefteidinterface#253,telia_pdnbearersdluserplanefteidip#254,telia_pdnbearersdluserplanefteidteid#255L,telia_pdnbearersebi#256,telia_pdnbearersgbrguaranteedbitratedl#257L,telia_pdnbearersgbrguaranteedbitrateul#258L,telia_pdnbearersgbrmaxbitratedl#259L,telia_pdnbearersgbrmaxbitrateul#260L,telia_pdnbearerslinkedebi#261,telia_pdnbearerspreemptioncapability#262,telia_pdnbearerspreemptionvulnerability#263,telia_pdnbearersprioritylevel#264,telia_pdnbearersqosqci#265,telia_pdnbearersuluserplanefteidinterface#266,telia_pdnbearersuluserplanefteidip#267,telia_pdnbearersuluserplanefteidteid#268L,telia_pdnconnectionsapn#269,telia_pdnconnectionsapnambrdl#270L,telia_pdnconnectionsapnambrul#271L,telia_pdnconnectionsdefaultbearerebi#272,telia_pdnconnectionsmobileipaddr#273,telia_pdnconnectionsmobileipv6addr#274,telia_pdnconnectionsselectionmode#275,telia_psn#276L,time#277L,ing_time#278L,job_instance_id#279] InputPaths: hdfs://nameservice-cdlpv2/data/prod/fin/base/source/childstream_tmp/ing_time=20180808200003/job_instance_id=20180808_200801/part-r-00000-eb25e03c-2f08-4614-8107-c85d20d8fc15.avro, hdfs://nameservice-cdlpv2/data/prod/fin/base/source/childstream_tmp/ing_time=20180808200003/job_instance_id=20180808_200801/part-r-00001-eb25e03c-2f08-4614-8107-c85d20d8fc15.avro

**Required screen shots are attached**
[DAG][1]
[Single Task running continuously - increasing shuffle write ][2]
[Query plan][3]


  [1]: https://i.stack.imgur.com/WizUn.jpg
  [2]: https://i.stack.imgur.com/t0XHA.jpg
  [3]: https://i.stack.imgur.com/TWrOO.jpg

更新 我进行了一些更改并感到满意,sortmergejoin保持运行,并且非常缓慢地连续添加更多行。我认为这不能完成一项任务。附加查询并突出显示增加的行。我在哪里做错了? 仅用红色突出显示的内容会不断增加,其他任何行/值或数据大小均不会更改。

sortmergerjoin keeps increasing record count

0 个答案:

没有答案