从PostgeSQL加载数据的配置:
#Configurations
conf = (pyspark.SparkConf()
.setMaster("local")
.setAppName("DBs")
.set("spark.executor.memory", "8g")
.set("spark-driver.memory","16g"))
#Spark Context
sc=pyspark.SparkContext(conf=conf)
#Reading data from PostgreSQL DB Table with 10M records
sqlContext=pyspark.SQLContext(sc)
url='postgresql://localhost:5432/dbname'
properties= {'user': 'user', 'password':'pass'}
df = DataFrameReader(sqlContext).jdbc( url='jdbc:%s' % url, table='table_name', properties=properties)
执行操作时:
df.head(4)
获得以下错误:
Py4JNetworkError:An error occurred while trying to connect to the Java server (127.0.0.1:46577)
错误的追溯如下:
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/home/zaman/Downloads/Setups/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 883, in send_command
response = connection.send_command(command)
File "/home/zaman/Downloads/Setups/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1040, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
Py4JNetworkError: Error while receiving
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:46577)
Traceback (most recent call last):
File "/home/zaman/Downloads/Setups/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 963, in start
self.socket.connect((self.address, self.port))
File "/home/zaman/anaconda2/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
内存在整个操作过程中保持不变的系统状态,如下所示:
注意:当我在具有相同配置的同一台计算机上尝试类似的操作时,仅仅5000条记录数据就会成功执行。