pyspark.sql.dataframe.DataFrame在10Million记录上应用操作时崩溃

时间:2017-09-25 07:50:22

标签: pyspark spark-dataframe

从PostgeSQL加载数据的配置:

#Configurations
conf = (pyspark.SparkConf()
     .setMaster("local")
     .setAppName("DBs")
     .set("spark.executor.memory", "8g")
     .set("spark-driver.memory","16g"))

#Spark Context
sc=pyspark.SparkContext(conf=conf)

#Reading data from PostgreSQL DB Table with 10M records
sqlContext=pyspark.SQLContext(sc)
url='postgresql://localhost:5432/dbname'
properties= {'user': 'user', 'password':'pass'}
df = DataFrameReader(sqlContext).jdbc( url='jdbc:%s' % url, table='table_name', properties=properties)

执行操作时:

df.head(4)

获得以下错误:

Py4JNetworkError:An error occurred while trying to connect to the Java server (127.0.0.1:46577) 

错误的追溯如下:

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/home/zaman/Downloads/Setups/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 883, in send_command
    response = connection.send_command(command)
  File "/home/zaman/Downloads/Setups/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1040, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
Py4JNetworkError: Error while receiving
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:46577)
Traceback (most recent call last):
  File "/home/zaman/Downloads/Setups/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 963, in start
    self.socket.connect((self.address, self.port))
  File "/home/zaman/anaconda2/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused

内存在整个操作过程中保持不变的系统状态,如下所示:

Memory is remaining 5.4 GB through-out the process.

注意:当我在具有相同配置的同一台计算机上尝试类似的操作时,仅仅5000条记录数据就会成功执行。

0 个答案:

没有答案