我正在应用join来连接2个数据集: 1.当我有一个大数据集而另一个小数据集时,我使用join加入了他们,但是改组花费了太多时间,在应用广播加入之后,我得到的改进也没有很多
应用广播连接之前的身体计划:
MainWindow::MainWindow(QWidget *parent) :
QMainWindow(parent)
{
QPixmap pixmap(64, 64);
auto *view = new QGraphicsView(this);
auto *item = new QGraphicsPixmapItem(pixmap);
auto *animation = new QVariantAnimation(this);
view->setScene(new QGraphicsScene(this));
view->setAlignment(Qt::AlignLeft | Qt::AlignTop);
view->setSceneRect(0, 0, 300, 200);
view->scene()->addItem(item);
setCentralWidget(view);
resize(302, 202);
connect(animation, &QVariantAnimation::valueChanged, [view, item](){
int x = item->x();
if (x < (view->sceneRect().width() - item->pixmap().width()))
item->setX(x + 1);
else
item->setX(0);
});
animation->setStartValue(0);
animation->setEndValue(1000);
animation->setDuration(10000);
animation->start();
}
应用广播加入后:
== Physical Plan ==
*BroadcastHashJoin [symbol#497, traderId#498, datetime#499], [ticker#523, buyerId#528, maxdate#518], Inner, BuildRight
:- *Project [symbol#497, traderId#498, datetime#499, buyorderqty#500, sellorderqty#501, position#502]
: +- *Filter ((isnotnull(datetime#499) && isnotnull(symbol#497)) && isnotnull(traderId#498))
: +- *FileScan csv [symbol#497,traderId#498,datetime#499,buyorderqty#500,sellorderqty#501,position#502] Batched: false, Format: CSV, Location: InMemoryFileIndex[hdfs://<hdfs://test.test.csv>, PartitionFilters: [], PushedFilters: [IsNotNull(datetime), IsNotNull(symbol), IsNotNull(traderId)], ReadSchema: struct<symbol:string,traderId:string,datetime:timestamp,buyorderqty:int,sellorderqty:int,position...
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true], input[1, string, true], input[2, timestamp, false]))
+- *Filter isnotnull(maxdate#518)
+- *HashAggregate(keys=[symbol#497, traderId#498], functions=[max(datetime#499)])
+- Exchange hashpartitioning(symbol#497, traderId#498, 200)
+- *HashAggregate(keys=[symbol#497, traderId#498], functions=[partial_max(datetime#499)])
+- *Project [symbol#497, traderId#498, datetime#499]
+- *Filter (isnotnull(traderId#498) && isnotnull(symbol#497))
+- *FileScan csv [symbol#497,traderId#498,datetime#499] Batched: false, Format: CSV, Location: InMemoryFileIndex
我像这样加入了这两个数据集:
== Physical Plan ==
*Project [maxdate#518, symbol#497, traderId#498, position#502]
+- *BroadcastHashJoin [ticker#523, buyerId#528, maxdate#518], [symbol#497, traderId#498, datetime#499], Inner, BuildRight
:- *Filter isnotnull(maxdate#518)
: +- *HashAggregate(keys=[symbol#497, traderId#498], functions=[max(datetime#499)])
: +- Exchange hashpartitioning(symbol#497, traderId#498, 200)
: +- *HashAggregate(keys=[symbol#497, traderId#498], functions=[partial_max(datetime#499)])
: +- *Project [symbol#497, traderId#498, datetime#499]
: +- *Filter (isnotnull(traderId#498) && isnotnull(symbol#497))
: +- *FileScan csv [symbol#497,traderId#498,datetime#499] Batched: false, Format: CSV, Location: InMemoryFileIndex[ PartitionFilters: [], PushedFilters: [IsNotNull(traderId), IsNotNull(symbol)], ReadSchema: struct<symbol:string,traderId:string,datetime:timestamp>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true], input[1, string, true], input[2, timestamp, true]))
+- *Project [symbol#497, traderId#498, datetime#499, position#502]
+- *Filter ((isnotnull(datetime#499) && isnotnull(symbol#497)) && isnotnull(traderId#498))
+- *FileScan csv [symbol#497,traderId#498,datetime#499,position#502] Batched: false, Format: CSV, Location: InMemoryFileIndex[hdfs://., PartitionFilters: []
我正在为各种迭代运行此程序,所以随着迭代次数的增加它会变慢。有人可以建议提高性能的方法吗?即使有2个小表如何将它们联接起来?