Spark SQL-重命名列会影响分区吗?

时间:2018-08-20 09:47:44

标签: apache-spark apache-spark-sql

我编写了一个explicitJoin API,该API使用l_或r_前缀重命名数据集中的列以消除歧义并解决火花沿袭的问题,即columnName1#77在columnName1#123,columnName2#55中找不到...

部分代码如下所示:

 def explicitJoin(other: Dataset[_], joinExpr: Column, joinType: String): ExplicitJoinExt = {
  val left = dataset.toDF(dataset.columns.map("l_" + _): _*)
  val right = other.toDF(other.columns.map("r_" + _): _*)

  new ExplicitJoinExt(left.join(right, joinExpr, joinType))
}

然后,用户可以传递联接表达式,例如$“ l_columnName1” === $“ r_columnName1” && ...,以便他们100%明确地知道要联接的列。

我遇到了一个新问题,其中分区太大而无法加载到内存中(org.apache.spark.shuffle.FetchFailedException:框架太大...。)但是读取输入(分区)数据集没有问题。

重命名列会影响输入数据集/数据帧的基础分区吗?

编辑

示例1-常规加入

    case class A(a: Int, b: String)

    val l = (0 to 1000000).map(i => A(i, i.toString))
    val r = (0 to 1000000).map(i => A(i, i.toString))

    val ds1 = l.toDF.as[A].repartition(100, $"a")
    val ds2 = r.toDF.as[A].repartition(100, $"a")

    val joined = ds1.join(ds2, Seq("a"), "inner")

    joined.explain

    == Physical Plan ==
    *Project [a#2, b#3, b#15]
    +- *SortMergeJoin [a#2], [a#14], Inner
       :- *Sort [a#2 ASC NULLS FIRST], false, 0
       :  +- Exchange hashpartitioning(a#2, 100)
       :     +- LocalTableScan [a#2, b#3]
       +- *Sort [a#14 ASC NULLS FIRST], false, 0
          +- ReusedExchange [a#14, b#15], Exchange hashpartitioning(a#2, 100)

示例2-使用我的(可能被误导的)涉及重命名的ExplicitJoinExt

    val joined = ds1
      .explicitJoin(ds2, $"l_a" === $"r_a", "inner") // Pimped on conversion to ExplicitJoin type, columns prefixed by l_ or r_. DS joined by expr and join type
      .selectLeft                                    // Select just left prefixed columns
      .toDF                                          // Convert back from ExplicitJoinExpr to DF
      .as[A]

    joined.explain


    == Physical Plan ==
    *Project [l_a#24 AS a#53, l_b#25 AS b#54]
    +- *BroadcastHashJoin [l_a#24], [r_a#29], Inner, BuildRight
       :- *Project [a#2 AS l_a#24, b#3 AS l_b#25]
       :  +- Exchange hashpartitioning(a#2, 100)
       :     +- LocalTableScan [a#2, b#3]
       +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
          +- *Project [a#14 AS r_a#29]
             +- Exchange hashpartitioning(a#14, 100)
                +- LocalTableScan [a#14]

因此,对于第二个联接,我们似乎会重新分区-对吗?

2 个答案:

答案 0 :(得分:0)

否,我检查了SPARK 2.3.1。重命名不会影响分区,至少不会影响这种方式:

<ion-header>
  <ion-navbar>
    <ion-title>
      Ionic PDF
    </ion-title>
  </ion-navbar>
</ion-header>

<ion-content padding>

  <ion-item>
    <ion-label stacked>From</ion-label>
    <ion-input [(ngModel)]="letterObj.from"></ion-input>
  </ion-item>
  <ion-item>
    <ion-label stacked>To</ion-label>
    <ion-input [(ngModel)]="letterObj.to"></ion-input>
  </ion-item>
  <ion-item>
    <ion-label stacked>Text</ion-label>
    <ion-textarea [(ngModel)]="letterObj.text" rows="10"></ion-textarea>
  </ion-item>

  <button ion-button full (click)="createPdf()">Create PDF</button>
  <button ion-button full (click)="downloadPdf()" color="secondary" [disabled]="!pdfObj">Download PDF</button>

</ion-content>

不,我也对此进行了检查。重命名不会影响分区,至少不会影响这种方式:

import { Component } from '@angular/core';
import { NavController, Platform } from 'ionic-angular';

import pdfMake from 'pdfmake/build/pdfmake';
import pdfFonts from 'pdfmake/build/vfs_fonts';
pdfMake.vfs = pdfFonts.pdfMake.vfs;

import { File } from '@ionic-native/file';
import { FileOpener } from '@ionic-native/file-opener';

@Component({
  selector: 'page-home',
  templateUrl: 'home.html'
})
export class HomePage {
  letterObj = {
    to: '',
    from: '',
    text: ''
  }

  pdfObj = null;

  constructor(public navCtrl: NavController, private plt: Platform, private file: File, private fileOpener: FileOpener) { }

  createPdf() {
    var docDefinition = {
      content: [
        { text: 'REMINDER', style: 'header' },
        { text: new Date().toTimeString(), alignment: 'right' },

        { text: 'From', style: 'subheader' },
        { text: this.letterObj.from },

        { text: 'To', style: 'subheader' },
        this.letterObj.to,

        { text: this.letterObj.text, style: 'story', margin: [0, 20, 0, 20] },

        {
          ul: [
            'Bacon',
            'Rips',
            'BBQ',
          ]
        }
      ],
      styles: {
        header: {
          fontSize: 18,
          bold: true,
        },
        subheader: {
          fontSize: 14,
          bold: true,
          margin: [0, 15, 0, 0]
        },
        story: {
          italic: true,
          alignment: 'center',
          width: '50%',
        }
      }
    }
    this.pdfObj = pdfMake.createPdf(docDefinition);
  }

  downloadPdf() {
    if (this.plt.is('cordova')) {
      this.pdfObj.getBuffer((buffer) => {
        var blob = new Blob([buffer], { type: 'application/pdf' });

        // Save the PDF to the data Directory of our App
        this.file.writeFile(this.file.dataDirectory, 'myletter.pdf', blob, { replace: true }).then(fileEntry => {
          // Open the PDf with the correct OS tools
          this.fileOpener.open(this.file.dataDirectory + 'myletter.pdf', 'application/pdf');
        })
      });
    } else {
      // On a browser simply use download!
      this.pdfObj.download();
    }
  }

}

EXPLAIN输出用于:

 val ds11 = ds1.repartition(4) 

显示,在我的案例2和4中,以分区数表示:

 val ds11 = ds1.repartition(2, $"cityid")

可以看到重命名的cols被映射回其原始名称。

在对其他地方的帖子进行的测试中,我们能够确定依赖AGGRegations或JOIN的新操作将默认为200,除非

val j = left.join(right, $"l_personid" === $"r_personid", "inner").explain
在代码中发出

并将其设置为所需值。如果只是一小部分正在联接的数据,则结果可能会有所不同。

答案 1 :(得分:0)

对于仍遇到此问题的用户:重命名列确实会影响Spark <3.0中的分区。

Seq((1, 2))
  .toDF("a", "b")
  .repartition($"b")
  .withColumnRenamed("b", "c")
  .repartition($"c")
  .explain()

给出以下计划:

== Physical Plan ==
Exchange hashpartitioning(c#40, 10)
+- *(1) Project [a#36, b#37 AS c#40]
   +- Exchange hashpartitioning(b#37, 10)
      +- LocalTableScan [a#36, b#37]

此问题已在this PR中修复。