我在同一台机器上有一个Spark群集和一个Hdfs。 我在每台机器的本地文件系统和hdfs分布式文件系统上复制了一个大约3G字节的文本文件。
我有一个简单的单词计数pyspark程序。
如果我提交程序从本地文件系统读取文件,它会持续大约33秒。 如果我提交程序从hdfs读取文件,它持续约46秒。
为什么?我期待完全相反的结果。
在sgvd的请求后添加:
16个奴隶1个主人
Spark Standalone没有特定设置(复制因子3)
版本1.5.2
import sys
sys.path.insert(0, '/usr/local/spark/python/')
sys.path.insert(0, '/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip')
import os
os.environ['SPARK_HOME']='/usr/local/spark'
os.environ['JAVA_HOME']='/usr/local/java'
from pyspark import SparkContext
#conf = pyspark.SparkConf().set<conf settings>
if sys.argv[1] == 'local':
print 'Esecuzine in modalita local file'
sc = SparkContext('spark://192.168.2.11:7077','Test Local file')
rdd = sc.textFile('/root/test2')
else:
print 'Esecuzine in modalita hdfs'
sc = SparkContext('spark://192.168.2.11:7077','Test HDFS file')
rdd = sc.textFile('hdfs://192.168.2.11:9000/data/test2')
rdd1 = rdd.flatMap(lambda x: x.split(' ')).map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y)
topFive = rdd1.takeOrdered(5,key=lambda x: -x[1])
print topFive
答案 0 :(得分:1)
这有点反直觉,但由于复制因子为3并且您有16个节点,因此每个节点平均有20%的数据存储在HDFS本地。然后,大约6个工作节点平均应足以读取整个文件而无需任何网络传输。
如果您记录运行时间与工作节点数量的关系,您应该注意到在大约6之后,从本地FS和HDFS读取之间没有区别。
上述计算可以使用变量来完成,例如x=number of worker nodes
,y= replication factor
,但您可以很容易地看到,因为从本地FS读取会强制该文件位于所有节点上,您最终会使用x=y
并且floor(x/y)
之后没有任何区别1}}使用的节点。这正是你所观察到的,起初看起来反直觉。你会在生产中使用100%的复制因子吗?
答案 1 :(得分:1)
Executor,Driver和RDD特有的参数是什么(关于Spilling和存储级别)?
来自Spark documentation
绩效影响
The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O.
为了组织shuffle的数据,Spark会生成一组任务 - 映射任务以组织数据,以及一组reduce任务来聚合它。这个命名法来自MapReduce,并不直接与Spark的地图和减少操作有关。
某些shuffle操作会消耗大量的堆内存,因为它们使用内存中的数据结构在传输记录之前或之后组织记录。 Specifically, reduceByKey and aggregateByKey create these structures on the map side, and 'ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection
。
我对memory/CPU core
限制memory/CPU core
Map & Reduce
yarn.nodemanager.resource.cpu-vcores
mapreduce.map.cpu.vcores
mapreduce.reduce.cpu.vcores
mapreduce.map.memory.mb
mapreduce.reduce.memory.mb
mapreduce.reduce.shuffle.memory.limit.percent
任务的spark.driver.memory
spark.driver.cores
spark.executor.memory
spark.executor.cores
spark.memory.fraction
限制感兴趣。
从Hadoop进行基准测试的关键参数:
public class ZoomView extends View {
//These two constants specify the minimum and maximum zoom
private static float MIN_ZOOM = 1f;
private static float MAX_ZOOM = 5f;
private float scaleFactor = 1.f;
private ScaleGestureDetector detector;
//These constants specify the mode that we're in
private static int NONE = 0;
private static int DRAG = 1;
private static int ZOOM = 2;
private int mode;
//These two variables keep track of the X and Y coordinate of the finger when it first
//touches the screen
private float startX = 0f;
private float startY = 0f;
//These two variables keep track of the amount we need to translate the canvas along the X
//and the Y coordinate
private float translateX = 0f;
private float translateY = 0f;
//These two variables keep track of the amount we translated the X and Y coordinates, the last time we
//panned.
private float previousTranslateX = 0f;
private float previousTranslateY = 0f;
public ZoomView(Context context) {
super(context);
detector = new ScaleGestureDetector(getContext(), new ScaleListener());
}
@Override
public boolean onTouchEvent(MotionEvent event) {
switch (event.getAction() & MotionEvent.ACTION_MASK) {
case MotionEvent.ACTION_DOWN:
mode = DRAG;
//We assign the current X and Y coordinate of the finger to startX and startY minus the previously translated
//amount for each coordinates This works even when we are translating the first time because the initial
//values for these two variables is zero.
startX = event.getX() - previousTranslateX;
startY = event.getY() - previousTranslateY;
break;
case MotionEvent.ACTION_MOVE:
translateX = event.getX() - startX;
translateY = event.getY() - startY;
//We cannot use startX and startY directly because we have adjusted their values using the previous translation values.
//This is why we need to add those values to startX and startY so that we can get the actual coordinates of the finger.
double distance = Math.sqrt(Math.pow(event.getX() - (startX + previousTranslateX), 2) +
Math.pow(event.getY() - (startY + previousTranslateY), 2)
);
if(distance > 0) {
dragged = true;
}
break;
case MotionEvent.ACTION_POINTER_DOWN:
mode = ZOOM;
break;
case MotionEvent.ACTION_UP:
mode = NONE;
dragged = false;
//All fingers went up, so let's save the value of translateX and translateY into previousTranslateX and
//previousTranslate
previousTranslateX = translateX;
previousTranslateY = translateY;
break;
case MotionEvent.ACTION_POINTER_UP:
mode = DRAG;
//This is not strictly necessary; we save the value of translateX and translateY into previousTranslateX
//and previousTranslateY when the second finger goes up
previousTranslateX = translateX;
previousTranslateY = translateY;
break;
}
detector.onTouchEvent(event);
//We redraw the canvas only in the following cases:
//
// o The mode is ZOOM
// OR
// o The mode is DRAG and the scale factor is not equal to 1 (meaning we have zoomed) and dragged is
// set to true (meaning the finger has actually moved)
if ((mode == DRAG && scaleFactor != 1f && dragged) || mode == ZOOM) {
invalidate();
}
return true;
}
@Override
public void onDraw(Canvas canvas) {
super.onDraw(canvas);
canvas.save();
//We're going to scale the X and Y coordinates by the same amount
canvas.scale(scaleFactor, scaleFactor);
//If translateX times -1 is lesser than zero, let's set it to zero. This takes care of the left bound
if((translateX * -1) < 0) {
translateX = 0;
}
//This is where we take care of the right bound. We compare translateX times -1 to (scaleFactor - 1) * displayWidth.
//If translateX is greater than that value, then we know that we've gone over the bound. So we set the value of
//translateX to (1 - scaleFactor) times the display width. Notice that the terms are interchanged; it's the same
//as doing -1 * (scaleFactor - 1) * displayWidth
else if((translateX * -1) > (scaleFactor - 1) * displayWidth) {
translateX = (1 - scaleFactor) * displayWidth;
}
if(translateY * -1 < 0) {
translateY = 0;
}
//We do the exact same thing for the bottom bound, except in this case we use the height of the display
else if((translateY * -1) > (scaleFactor - 1) * displayHeight) {
translateY = (1 - scaleFactor) * displayHeight;
}
//We need to divide by the scale factor here, otherwise we end up with excessive panning based on our zoom level
//because the translation amount also gets scaled according to how much we've zoomed into the canvas.
canvas.translate(translateX / scaleFactor, translateY / scaleFactor);
/* The rest of your canvas-drawing code */
canvas.restore();
}
private class ScaleListener extends ScaleGestureDetector.SimpleOnScaleGestureListener {
@Override
public boolean onScale(ScaleGestureDetector detector) {
scaleFactor *= detector.getScaleFactor();
scaleFactor = Math.max(MIN_ZOOM, Math.min(scaleFactor, MAX_ZOOM));
return true;
}
}
}
将针对Hadoop的SPARK参数与等效性进行基准测试的关键参数。
ifelse
这些只是一些关键参数。查看SPARK和Map Reduce
的详细设置如果没有正确的参数集,我们无法比较两种不同技术的作业效果。
答案 2 :(得分:0)
这是因为数据的分布方式,单个文档不是很好的选择,有几个更好的选择,如parquet,如果你这样做,你会注意到性能将是值得注意的是,这是因为文件的分区方式允许您的Apache Spark
群集并行读取这些部分,从而提高性能。