在http://spark.apache.org/examples.html
的Pi示例中在Estimating Pi示例中,我不明白Python / Scala与Java示例之间存在差异。 Python和Scala都使用map和reduce:
的Python
def sample(p):
x, y = random(), random()
return 1 if x*x + y*y < 1 else 0
count = spark.parallelize(xrange(0, NUM_SAMPLES)).map(sample) \
.reduce(lambda a, b: a + b)
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
Scala的
val count = spark.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random()
val y = Math.random()
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / NUM_SAMPLES)
但Java正在使用过滤器:
int count = spark.parallelize(makeRange(1, NUM_SAMPLES)).filter(new
Function<Integer, Boolean>() {
public Boolean call(Integer i) {
double x = Math.random();
double y = Math.random();
return x*x + y*y < 1;
}
}).count();
System.out.println("Pi is roughly " + 4 * count / NUM_SAMPLES);
这只是一个文档错误/错误吗?在Java中是否优选过滤器,在Scala和Python中首选map / reduce是出于某种原因?
答案 0 :(得分:3)
这些方法是等效的。 Java代码只是计算Scala / Python映射返回的情况1.只是为了使它更透明:
def inside(x, y):
"""Check if point (x, y) is inside a unit circle
with center in the origin (0, 0)"""
return x*x + y*y < 1
points = ...
# Scala / Python code is equivalent to this
sum([1 if inside(x, y) else 0 for (x, y) in points])
# While Java code is equivalent to this
len([(x, y) for (x, y) in points if inside(x, y)])
最后总和得到的是与圆圈所覆盖的正方形区域的分数成比例,并且从公式中我们知道它等于π。