Spark使用广播变量加入较慢的迭代

时间:2018-09-25 03:39:33

标签: apache-spark pyspark apache-spark-sql spark-java

我正在应用join来连接2个数据集:  1.当我有一个大数据集而另一个小数据集时,我使用join加入了他们,但是改组花费了太多时间,在应用广播加入之后,我得到的改进也没有很多

应用广播连接之前的身体计划:

MainWindow::MainWindow(QWidget *parent) :
    QMainWindow(parent)
{
    QPixmap pixmap(64, 64);
    auto *view = new QGraphicsView(this);
    auto *item = new QGraphicsPixmapItem(pixmap);
    auto *animation = new QVariantAnimation(this);

    view->setScene(new QGraphicsScene(this));
    view->setAlignment(Qt::AlignLeft | Qt::AlignTop);
    view->setSceneRect(0, 0, 300, 200);
    view->scene()->addItem(item);

    setCentralWidget(view);
    resize(302, 202);

    connect(animation, &QVariantAnimation::valueChanged, [view, item](){
        int x = item->x();

        if (x < (view->sceneRect().width() - item->pixmap().width()))
            item->setX(x + 1);
        else
            item->setX(0);
    });

    animation->setStartValue(0);
    animation->setEndValue(1000);
    animation->setDuration(10000);
    animation->start();
}

应用广播加入后:

== Physical Plan ==
*BroadcastHashJoin [symbol#497, traderId#498, datetime#499], [ticker#523, buyerId#528, maxdate#518], Inner, BuildRight
:- *Project [symbol#497, traderId#498, datetime#499, buyorderqty#500, sellorderqty#501, position#502]
:  +- *Filter ((isnotnull(datetime#499) && isnotnull(symbol#497)) && isnotnull(traderId#498))
:     +- *FileScan csv [symbol#497,traderId#498,datetime#499,buyorderqty#500,sellorderqty#501,position#502] Batched: false, Format: CSV, Location: InMemoryFileIndex[hdfs://<hdfs://test.test.csv>, PartitionFilters: [], PushedFilters: [IsNotNull(datetime), IsNotNull(symbol), IsNotNull(traderId)], ReadSchema: struct<symbol:string,traderId:string,datetime:timestamp,buyorderqty:int,sellorderqty:int,position...
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true], input[1, string, true], input[2, timestamp, false]))
   +- *Filter isnotnull(maxdate#518)
      +- *HashAggregate(keys=[symbol#497, traderId#498], functions=[max(datetime#499)])
         +- Exchange hashpartitioning(symbol#497, traderId#498, 200)
            +- *HashAggregate(keys=[symbol#497, traderId#498], functions=[partial_max(datetime#499)])
               +- *Project [symbol#497, traderId#498, datetime#499]
                  +- *Filter (isnotnull(traderId#498) && isnotnull(symbol#497))
                     +- *FileScan csv [symbol#497,traderId#498,datetime#499] Batched: false, Format: CSV, Location: InMemoryFileIndex

我像这样加入了这两个数据集:

== Physical Plan ==
*Project [maxdate#518, symbol#497, traderId#498, position#502]
+- *BroadcastHashJoin [ticker#523, buyerId#528, maxdate#518], [symbol#497, traderId#498, datetime#499], Inner, BuildRight
   :- *Filter isnotnull(maxdate#518)
   :  +- *HashAggregate(keys=[symbol#497, traderId#498], functions=[max(datetime#499)])
   :     +- Exchange hashpartitioning(symbol#497, traderId#498, 200)
   :        +- *HashAggregate(keys=[symbol#497, traderId#498], functions=[partial_max(datetime#499)])
   :           +- *Project [symbol#497, traderId#498, datetime#499]
   :              +- *Filter (isnotnull(traderId#498) && isnotnull(symbol#497))
   :                 +- *FileScan csv [symbol#497,traderId#498,datetime#499] Batched: false, Format: CSV, Location: InMemoryFileIndex[ PartitionFilters: [], PushedFilters: [IsNotNull(traderId), IsNotNull(symbol)], ReadSchema: struct<symbol:string,traderId:string,datetime:timestamp>
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true], input[1, string, true], input[2, timestamp, true]))
      +- *Project [symbol#497, traderId#498, datetime#499, position#502]
         +- *Filter ((isnotnull(datetime#499) && isnotnull(symbol#497)) && isnotnull(traderId#498))
            +- *FileScan csv [symbol#497,traderId#498,datetime#499,position#502] Batched: false, Format: CSV, Location: InMemoryFileIndex[hdfs://., PartitionFilters: []

我正在为各种迭代运行此程序,所以随着迭代次数的增加它会变慢。有人可以建议提高性能的方法吗?即使有2个小表如何将它们联接起来?

0 个答案:

没有答案