我是pyspark的新手,在最终设置了spark并且能够从命令调用spark-shell和pyspark之后,我尝试运行以下代码来比较我的系统在有和没有spark的情况下的表现:
import time
import pyspark
import random
import os
if 'SPARK_HOME' not in os.environ:
os.environ['SPARK_HOME'] = 'C:/spark'
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
num_samples = 100000000
t1 = time.time()
with pyspark.SparkContext("local[7]", appName="Pi") as sc:
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
sc.stop()
pi = 4 * count / num_samples
print(pi)
print('total time: {}'.format(time.time()-t1))
t2 = time.time()
count = [inside(p) for p in range(num_samples)]
pi = 4 * sum(count) / num_samples
print(pi)
print('total time without spark: {}'.format(time.time()-t2))
但成功提供此输出后:
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
17/10/10 14:15:34 WARN NativeCodeLoader: Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
17/10/10 14:15:35 WARN SizeEstimator: Failed to check whether
UseCompressedOops is set; assuming yes
3.14152352
total time: 13.785562992095947
3.14170472
total time: 30.516133069992065
SUCCESS: The process with PID 11448 (child process of PID 5992) has been
terminated.
SUCCESS: The process with PID 5992 (child process of PID 6640) has been
terminated.
SUCCESS: The process with PID 6640 (child process of PID 17908) has been
terminated.
但它最后也会产生以下错误:
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java
server (127.0.0.1:61315)
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\py4j\java_gateway.py", line
1021, in send_command
self.socket.sendall(command.encode("utf-8"))
ConnectionResetError: [WinError 10054] An existing connection was forcibly
closed by the remote host