Question

我在同一台机器上有一个Spark群集和一个Hdfs。我在每台机器的本地文件系统和hdfs分布式文件系统上复制了一个大约3G字节的文本文件。

我有一个简单的单词计数pyspark程序。

如果我提交程序从本地文件系统读取文件，它会持续大约33秒。如果我提交程序从hdfs读取文件，它持续约46秒。

为什么？我期待完全相反的结果。

在sgvd的请求后添加：

16个奴隶1个主人

Spark Standalone没有特定设置（复制因子3）

版本1.5.2

import sys
sys.path.insert(0, '/usr/local/spark/python/')
sys.path.insert(0, '/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip')
import os
os.environ['SPARK_HOME']='/usr/local/spark'
os.environ['JAVA_HOME']='/usr/local/java'
from pyspark import SparkContext
#conf = pyspark.SparkConf().set<conf settings>


if sys.argv[1] == 'local':
    print 'Esecuzine in modalita local file'
    sc = SparkContext('spark://192.168.2.11:7077','Test Local file')
    rdd = sc.textFile('/root/test2')
else:
    print 'Esecuzine in modalita hdfs'
    sc = SparkContext('spark://192.168.2.11:7077','Test HDFS file')
    rdd = sc.textFile('hdfs://192.168.2.11:9000/data/test2')


rdd1 = rdd.flatMap(lambda x: x.split(' ')).map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y)
topFive = rdd1.takeOrdered(5,key=lambda x: -x[1])
print topFive

Answer 1

这有点反直觉，但由于复制因子为3并且您有16个节点，因此每个节点平均有20％的数据存储在HDFS本地。然后，大约6个工作节点平均应足以读取整个文件而无需任何网络传输。

如果您记录运行时间与工作节点数量的关系，您应该注意到在大约6之后，从本地FS和HDFS读取之间没有区别。

上述计算可以使用变量来完成，例如x=number of worker nodes，y= replication factor，但您可以很容易地看到，因为从本地FS读取会强制该文件位于所有节点上，您最终会使用x=y并且floor(x/y)之后没有任何区别1}}使用的节点。这正是你所观察到的，起初看起来反直觉。你会在生产中使用100％的复制因子吗？

Answer 2

Executor，Driver和RDD特有的参数是什么（关于Spilling和存储级别）？

来自Spark documentation

绩效影响

The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O.为了组织shuffle的数据，Spark会生成一组任务 - 映射任务以组织数据，以及一组reduce任务来聚合它。这个命名法来自MapReduce，并不直接与Spark的地图和减少操作有关。

某些shuffle操作会消耗大量的堆内存，因为它们使用内存中的数据结构在传输记录之前或之后组织记录。 Specifically, reduceByKey and aggregateByKey create these structures on the map side, and 'ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection。

我对memory/CPU core限制memory/CPU core Map & Reduce yarn.nodemanager.resource.cpu-vcores mapreduce.map.cpu.vcores mapreduce.reduce.cpu.vcores mapreduce.map.memory.mb mapreduce.reduce.memory.mb mapreduce.reduce.shuffle.memory.limit.percent任务的spark.driver.memory spark.driver.cores spark.executor.memory spark.executor.cores spark.memory.fraction限制感兴趣。

从Hadoop进行基准测试的关键参数：

public class ZoomView extends View {

        //These two constants specify the minimum and maximum zoom
        private static float MIN_ZOOM = 1f;
        private static float MAX_ZOOM = 5f;

        private float scaleFactor = 1.f;
        private ScaleGestureDetector detector;

        //These constants specify the mode that we're in
        private static int NONE = 0;
        private static int DRAG = 1;
        private static int ZOOM = 2;

        private int mode;

        //These two variables keep track of the X and Y coordinate of the finger when it first
        //touches the screen
        private float startX = 0f;
        private float startY = 0f;

        //These two variables keep track of the amount we need to translate the canvas along the X
        //and the Y coordinate
        private float translateX = 0f;
        private float translateY = 0f;

        //These two variables keep track of the amount we translated the X and Y coordinates, the last time we
        //panned.
        private float previousTranslateX = 0f;
        private float previousTranslateY = 0f;    

        public ZoomView(Context context) {
            super(context);
            detector = new ScaleGestureDetector(getContext(), new ScaleListener());
        }

        @Override
        public boolean onTouchEvent(MotionEvent event) {

            switch (event.getAction() & MotionEvent.ACTION_MASK) {

                case MotionEvent.ACTION_DOWN:
                    mode = DRAG;

                    //We assign the current X and Y coordinate of the finger to startX and startY minus the previously translated
                    //amount for each coordinates This works even when we are translating the first time because the initial
                    //values for these two variables is zero.               
                    startX = event.getX() - previousTranslateX;
                    startY = event.getY() - previousTranslateY;
                    break;

                case MotionEvent.ACTION_MOVE:               
                    translateX = event.getX() - startX;
                    translateY = event.getY() - startY;

                    //We cannot use startX and startY directly because we have adjusted their values using the previous translation values. 
                    //This is why we need to add those values to startX and startY so that we can get the actual coordinates of the finger.
                    double distance = Math.sqrt(Math.pow(event.getX() - (startX + previousTranslateX), 2) + 
                                                Math.pow(event.getY() - (startY + previousTranslateY), 2)
                                               );

                    if(distance > 0) {
                       dragged = true;
                    }               

                    break;

                case MotionEvent.ACTION_POINTER_DOWN:
                    mode = ZOOM;
                    break;

                case MotionEvent.ACTION_UP:
                    mode = NONE;
                    dragged = false;

                    //All fingers went up, so let's save the value of translateX and translateY into previousTranslateX and 
                    //previousTranslate           
                    previousTranslateX = translateX;
                    previousTranslateY = translateY;
                    break;

                case MotionEvent.ACTION_POINTER_UP:
                    mode = DRAG;

                    //This is not strictly necessary; we save the value of translateX and translateY into previousTranslateX
                    //and previousTranslateY when the second finger goes up
                    previousTranslateX = translateX;
                    previousTranslateY = translateY;
                    break;       
            }

            detector.onTouchEvent(event);

            //We redraw the canvas only in the following cases:
            //
            // o The mode is ZOOM 
            //        OR
            // o The mode is DRAG and the scale factor is not equal to 1 (meaning we have zoomed) and dragged is
            //   set to true (meaning the finger has actually moved)
            if ((mode == DRAG && scaleFactor != 1f && dragged) || mode == ZOOM) {
                invalidate();
            }

            return true;
        }

        @Override
        public void onDraw(Canvas canvas) {
            super.onDraw(canvas);

            canvas.save();

            //We're going to scale the X and Y coordinates by the same amount
            canvas.scale(scaleFactor, scaleFactor);

            //If translateX times -1 is lesser than zero, let's set it to zero. This takes care of the left bound
            if((translateX * -1) < 0) {
               translateX = 0;
            }

            //This is where we take care of the right bound. We compare translateX times -1 to (scaleFactor - 1) * displayWidth.
            //If translateX is greater than that value, then we know that we've gone over the bound. So we set the value of 
            //translateX to (1 - scaleFactor) times the display width. Notice that the terms are interchanged; it's the same
            //as doing -1 * (scaleFactor - 1) * displayWidth
            else if((translateX * -1) > (scaleFactor - 1) * displayWidth) {
               translateX = (1 - scaleFactor) * displayWidth;
            }

            if(translateY * -1 < 0) {
               translateY = 0;
            }

            //We do the exact same thing for the bottom bound, except in this case we use the height of the display
            else if((translateY * -1) > (scaleFactor - 1) * displayHeight) {
               translateY = (1 - scaleFactor) * displayHeight;
            }

            //We need to divide by the scale factor here, otherwise we end up with excessive panning based on our zoom level
            //because the translation amount also gets scaled according to how much we've zoomed into the canvas.
            canvas.translate(translateX / scaleFactor, translateY / scaleFactor);

            /* The rest of your canvas-drawing code */
            canvas.restore();
        }

        private class ScaleListener extends ScaleGestureDetector.SimpleOnScaleGestureListener {
            @Override
            public boolean onScale(ScaleGestureDetector detector) {
                scaleFactor *= detector.getScaleFactor();
                scaleFactor = Math.max(MIN_ZOOM, Math.min(scaleFactor, MAX_ZOOM));
                return true;
            }
        }
    }

将针对Hadoop的SPARK参数与等效性进行基准测试的关键参数。

ifelse

这些只是一些关键参数。查看SPARK和Map Reduce

的详细设置

如果没有正确的参数集，我们无法比较两种不同技术的作业效果。

Answer 3

这是因为数据的分布方式，单个文档不是很好的选择，有几个更好的选择，如parquet，如果你这样做，你会注意到性能将是值得注意的是，这是因为文件的分区方式允许您的Apache Spark群集并行读取这些部分，从而提高性能。

Spark本地vs hdfs permormance

3 个答案: