我编写了一个explicitJoin API,该API使用l_或r_前缀重命名数据集中的列以消除歧义并解决火花沿袭的问题,即columnName1#77在columnName1#123,columnName2#55中找不到...
部分代码如下所示:
def explicitJoin(other: Dataset[_], joinExpr: Column, joinType: String): ExplicitJoinExt = {
val left = dataset.toDF(dataset.columns.map("l_" + _): _*)
val right = other.toDF(other.columns.map("r_" + _): _*)
new ExplicitJoinExt(left.join(right, joinExpr, joinType))
}
然后,用户可以传递联接表达式,例如$“ l_columnName1” === $“ r_columnName1” && ...,以便他们100%明确地知道要联接的列。
我遇到了一个新问题,其中分区太大而无法加载到内存中(org.apache.spark.shuffle.FetchFailedException:框架太大...。)但是读取输入(分区)数据集没有问题。
重命名列会影响输入数据集/数据帧的基础分区吗?
编辑
示例1-常规加入
case class A(a: Int, b: String)
val l = (0 to 1000000).map(i => A(i, i.toString))
val r = (0 to 1000000).map(i => A(i, i.toString))
val ds1 = l.toDF.as[A].repartition(100, $"a")
val ds2 = r.toDF.as[A].repartition(100, $"a")
val joined = ds1.join(ds2, Seq("a"), "inner")
joined.explain
== Physical Plan ==
*Project [a#2, b#3, b#15]
+- *SortMergeJoin [a#2], [a#14], Inner
:- *Sort [a#2 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(a#2, 100)
: +- LocalTableScan [a#2, b#3]
+- *Sort [a#14 ASC NULLS FIRST], false, 0
+- ReusedExchange [a#14, b#15], Exchange hashpartitioning(a#2, 100)
示例2-使用我的(可能被误导的)涉及重命名的ExplicitJoinExt
val joined = ds1
.explicitJoin(ds2, $"l_a" === $"r_a", "inner") // Pimped on conversion to ExplicitJoin type, columns prefixed by l_ or r_. DS joined by expr and join type
.selectLeft // Select just left prefixed columns
.toDF // Convert back from ExplicitJoinExpr to DF
.as[A]
joined.explain
== Physical Plan ==
*Project [l_a#24 AS a#53, l_b#25 AS b#54]
+- *BroadcastHashJoin [l_a#24], [r_a#29], Inner, BuildRight
:- *Project [a#2 AS l_a#24, b#3 AS l_b#25]
: +- Exchange hashpartitioning(a#2, 100)
: +- LocalTableScan [a#2, b#3]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- *Project [a#14 AS r_a#29]
+- Exchange hashpartitioning(a#14, 100)
+- LocalTableScan [a#14]
因此,对于第二个联接,我们似乎会重新分区-对吗?
答案 0 :(得分:0)
否,我检查了SPARK 2.3.1。重命名不会影响分区,至少不会影响这种方式:
<ion-header>
<ion-navbar>
<ion-title>
Ionic PDF
</ion-title>
</ion-navbar>
</ion-header>
<ion-content padding>
<ion-item>
<ion-label stacked>From</ion-label>
<ion-input [(ngModel)]="letterObj.from"></ion-input>
</ion-item>
<ion-item>
<ion-label stacked>To</ion-label>
<ion-input [(ngModel)]="letterObj.to"></ion-input>
</ion-item>
<ion-item>
<ion-label stacked>Text</ion-label>
<ion-textarea [(ngModel)]="letterObj.text" rows="10"></ion-textarea>
</ion-item>
<button ion-button full (click)="createPdf()">Create PDF</button>
<button ion-button full (click)="downloadPdf()" color="secondary" [disabled]="!pdfObj">Download PDF</button>
</ion-content>
不,我也对此进行了检查。重命名不会影响分区,至少不会影响这种方式:
import { Component } from '@angular/core';
import { NavController, Platform } from 'ionic-angular';
import pdfMake from 'pdfmake/build/pdfmake';
import pdfFonts from 'pdfmake/build/vfs_fonts';
pdfMake.vfs = pdfFonts.pdfMake.vfs;
import { File } from '@ionic-native/file';
import { FileOpener } from '@ionic-native/file-opener';
@Component({
selector: 'page-home',
templateUrl: 'home.html'
})
export class HomePage {
letterObj = {
to: '',
from: '',
text: ''
}
pdfObj = null;
constructor(public navCtrl: NavController, private plt: Platform, private file: File, private fileOpener: FileOpener) { }
createPdf() {
var docDefinition = {
content: [
{ text: 'REMINDER', style: 'header' },
{ text: new Date().toTimeString(), alignment: 'right' },
{ text: 'From', style: 'subheader' },
{ text: this.letterObj.from },
{ text: 'To', style: 'subheader' },
this.letterObj.to,
{ text: this.letterObj.text, style: 'story', margin: [0, 20, 0, 20] },
{
ul: [
'Bacon',
'Rips',
'BBQ',
]
}
],
styles: {
header: {
fontSize: 18,
bold: true,
},
subheader: {
fontSize: 14,
bold: true,
margin: [0, 15, 0, 0]
},
story: {
italic: true,
alignment: 'center',
width: '50%',
}
}
}
this.pdfObj = pdfMake.createPdf(docDefinition);
}
downloadPdf() {
if (this.plt.is('cordova')) {
this.pdfObj.getBuffer((buffer) => {
var blob = new Blob([buffer], { type: 'application/pdf' });
// Save the PDF to the data Directory of our App
this.file.writeFile(this.file.dataDirectory, 'myletter.pdf', blob, { replace: true }).then(fileEntry => {
// Open the PDf with the correct OS tools
this.fileOpener.open(this.file.dataDirectory + 'myletter.pdf', 'application/pdf');
})
});
} else {
// On a browser simply use download!
this.pdfObj.download();
}
}
}
EXPLAIN输出用于:
val ds11 = ds1.repartition(4)
显示,在我的案例2和4中,以分区数表示:
val ds11 = ds1.repartition(2, $"cityid")
可以看到重命名的cols被映射回其原始名称。
在对其他地方的帖子进行的测试中,我们能够确定依赖AGGRegations或JOIN的新操作将默认为200,除非
val j = left.join(right, $"l_personid" === $"r_personid", "inner").explain
在代码中发出并将其设置为所需值。如果只是一小部分正在联接的数据,则结果可能会有所不同。
答案 1 :(得分:0)
对于仍遇到此问题的用户:重命名列确实会影响Spark <3.0中的分区。
Seq((1, 2))
.toDF("a", "b")
.repartition($"b")
.withColumnRenamed("b", "c")
.repartition($"c")
.explain()
给出以下计划:
== Physical Plan ==
Exchange hashpartitioning(c#40, 10)
+- *(1) Project [a#36, b#37 AS c#40]
+- Exchange hashpartitioning(b#37, 10)
+- LocalTableScan [a#36, b#37]
此问题已在this PR中修复。