Question

我编写了一个explicitJoin API，该API使用l_或r_前缀重命名数据集中的列以消除歧义并解决火花沿袭的问题，即columnName1＃77在columnName1＃123，columnName2＃55中找不到...

部分代码如下所示：

 def explicitJoin(other: Dataset[_], joinExpr: Column, joinType: String): ExplicitJoinExt = {
  val left = dataset.toDF(dataset.columns.map("l_" + _): _*)
  val right = other.toDF(other.columns.map("r_" + _): _*)

  new ExplicitJoinExt(left.join(right, joinExpr, joinType))
}

然后，用户可以传递联接表达式，例如$“ l_columnName1” === $“ r_columnName1” && ...，以便他们100％明确地知道要联接的列。

我遇到了一个新问题，其中分区太大而无法加载到内存中（org.apache.spark.shuffle.FetchFailedException：框架太大...。）但是读取输入（分区）数据集没有问题。

重命名列会影响输入数据集/数据帧的基础分区吗？

编辑

示例1-常规加入

    case class A(a: Int, b: String)

    val l = (0 to 1000000).map(i => A(i, i.toString))
    val r = (0 to 1000000).map(i => A(i, i.toString))

    val ds1 = l.toDF.as[A].repartition(100, $"a")
    val ds2 = r.toDF.as[A].repartition(100, $"a")

    val joined = ds1.join(ds2, Seq("a"), "inner")

    joined.explain

    == Physical Plan ==
    *Project [a#2, b#3, b#15]
    +- *SortMergeJoin [a#2], [a#14], Inner
       :- *Sort [a#2 ASC NULLS FIRST], false, 0
       :  +- Exchange hashpartitioning(a#2, 100)
       :     +- LocalTableScan [a#2, b#3]
       +- *Sort [a#14 ASC NULLS FIRST], false, 0
          +- ReusedExchange [a#14, b#15], Exchange hashpartitioning(a#2, 100)

示例2-使用我的（可能被误导的）涉及重命名的ExplicitJoinExt

    val joined = ds1
      .explicitJoin(ds2, $"l_a" === $"r_a", "inner") // Pimped on conversion to ExplicitJoin type, columns prefixed by l_ or r_. DS joined by expr and join type
      .selectLeft                                    // Select just left prefixed columns
      .toDF                                          // Convert back from ExplicitJoinExpr to DF
      .as[A]

    joined.explain


    == Physical Plan ==
    *Project [l_a#24 AS a#53, l_b#25 AS b#54]
    +- *BroadcastHashJoin [l_a#24], [r_a#29], Inner, BuildRight
       :- *Project [a#2 AS l_a#24, b#3 AS l_b#25]
       :  +- Exchange hashpartitioning(a#2, 100)
       :     +- LocalTableScan [a#2, b#3]
       +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
          +- *Project [a#14 AS r_a#29]
             +- Exchange hashpartitioning(a#14, 100)
                +- LocalTableScan [a#14]

因此，对于第二个联接，我们似乎会重新分区-对吗？

Answer 1

否，我检查了SPARK 2.3.1。重命名不会影响分区，至少不会影响这种方式：

<ion-header>
  <ion-navbar>
    <ion-title>
      Ionic PDF
    </ion-title>
  </ion-navbar>
</ion-header>

<ion-content padding>

  <ion-item>
    <ion-label stacked>From</ion-label>
    <ion-input [(ngModel)]="letterObj.from"></ion-input>
  </ion-item>
  <ion-item>
    <ion-label stacked>To</ion-label>
    <ion-input [(ngModel)]="letterObj.to"></ion-input>
  </ion-item>
  <ion-item>
    <ion-label stacked>Text</ion-label>
    <ion-textarea [(ngModel)]="letterObj.text" rows="10"></ion-textarea>
  </ion-item>

  <button ion-button full (click)="createPdf()">Create PDF</button>
  <button ion-button full (click)="downloadPdf()" color="secondary" [disabled]="!pdfObj">Download PDF</button>

</ion-content>

不，我也对此进行了检查。重命名不会影响分区，至少不会影响这种方式：

import { Component } from '@angular/core';
import { NavController, Platform } from 'ionic-angular';

import pdfMake from 'pdfmake/build/pdfmake';
import pdfFonts from 'pdfmake/build/vfs_fonts';
pdfMake.vfs = pdfFonts.pdfMake.vfs;

import { File } from '@ionic-native/file';
import { FileOpener } from '@ionic-native/file-opener';

@Component({
  selector: 'page-home',
  templateUrl: 'home.html'
})
export class HomePage {
  letterObj = {
    to: '',
    from: '',
    text: ''
  }

  pdfObj = null;

  constructor(public navCtrl: NavController, private plt: Platform, private file: File, private fileOpener: FileOpener) { }

  createPdf() {
    var docDefinition = {
      content: [
        { text: 'REMINDER', style: 'header' },
        { text: new Date().toTimeString(), alignment: 'right' },

        { text: 'From', style: 'subheader' },
        { text: this.letterObj.from },

        { text: 'To', style: 'subheader' },
        this.letterObj.to,

        { text: this.letterObj.text, style: 'story', margin: [0, 20, 0, 20] },

        {
          ul: [
            'Bacon',
            'Rips',
            'BBQ',
          ]
        }
      ],
      styles: {
        header: {
          fontSize: 18,
          bold: true,
        },
        subheader: {
          fontSize: 14,
          bold: true,
          margin: [0, 15, 0, 0]
        },
        story: {
          italic: true,
          alignment: 'center',
          width: '50%',
        }
      }
    }
    this.pdfObj = pdfMake.createPdf(docDefinition);
  }

  downloadPdf() {
    if (this.plt.is('cordova')) {
      this.pdfObj.getBuffer((buffer) => {
        var blob = new Blob([buffer], { type: 'application/pdf' });

        // Save the PDF to the data Directory of our App
        this.file.writeFile(this.file.dataDirectory, 'myletter.pdf', blob, { replace: true }).then(fileEntry => {
          // Open the PDf with the correct OS tools
          this.fileOpener.open(this.file.dataDirectory + 'myletter.pdf', 'application/pdf');
        })
      });
    } else {
      // On a browser simply use download!
      this.pdfObj.download();
    }
  }

}

EXPLAIN输出用于：

 val ds11 = ds1.repartition(4)

显示，在我的案例2和4中，以分区数表示：

 val ds11 = ds1.repartition(2, $"cityid")

可以看到重命名的cols被映射回其原始名称。

在对其他地方的帖子进行的测试中，我们能够确定依赖AGGRegations或JOIN的新操作将默认为200，除非

val j = left.join(right, $"l_personid" === $"r_personid", "inner").explain

在代码中发出

并将其设置为所需值。如果只是一小部分正在联接的数据，则结果可能会有所不同。

Answer 2

对于仍遇到此问题的用户：重命名列确实会影响Spark <3.0中的分区。

Seq((1, 2))
  .toDF("a", "b")
  .repartition($"b")
  .withColumnRenamed("b", "c")
  .repartition($"c")
  .explain()

给出以下计划：

== Physical Plan ==
Exchange hashpartitioning(c#40, 10)
+- *(1) Project [a#36, b#37 AS c#40]
   +- Exchange hashpartitioning(b#37, 10)
      +- LocalTableScan [a#36, b#37]

此问题已在this PR中修复。

Spark SQL-重命名列会影响分区吗？

2 个答案: