将pandas数据帧加载到Spark集群中

时间:2017-06-08 00:33:56

标签: postgresql apache-spark pyspark apache-spark-sql

我有一个postgres数据库,我想运行一个查询并将表加载到spark数据帧中。我的数据库的一些列是数组。例如:

=> select id, f_2 from raw limit 1;

将返回

    id   |  f_2  
---------+-----------
 1       | {{140,130},{NULL,NULL},{NULL,NULL}}

我想要的是访问使用此查询在postgres中很容易的140(内部数组的第一个元素):

=> select id, f_2[1][1] from raw limit 1;
        id   |  f_2  
    ---------+-----------
     1       | 140

但我想将其加载到spark数据帧中,这是我加载数据的代码:

df = sqlContext.sql("""
select id as id,
f_2 as A
from raw
""")

并返回此错误:

Py4JJavaError: An error occurred while calling o560.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 4, localhost, executor driver): java.lang.ClassCastException: [Ljava.lang.Integer; cannot be cast to java.lang.Integer

然后我尝试了这个:

df = sqlContext.sql("""
select id as id,
f_2[0] as A
from raw
""")

并得到同样的错误然后尝试了这个:

df = sqlContext.sql("""
select id as id,
f_2[0][0] as A
from raw
""")

并返回此错误:

ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))

AnalysisException: u"Can't extract value from f_2#32685[0];"

0 个答案:

没有答案