Question

我有一个postgres数据库，我想运行一个查询并将表加载到spark数据帧中。我的数据库的一些列是数组。例如：

=> select id, f_2 from raw limit 1;

将返回

    id   |  f_2  
---------+-----------
 1       | {{140,130},{NULL,NULL},{NULL,NULL}}

我想要的是访问使用此查询在postgres中很容易的140（内部数组的第一个元素）：

=> select id, f_2[1][1] from raw limit 1;
        id   |  f_2  
    ---------+-----------
     1       | 140

但我想将其加载到spark数据帧中，这是我加载数据的代码：

df = sqlContext.sql("""
select id as id,
f_2 as A
from raw
""")

并返回此错误：

Py4JJavaError: An error occurred while calling o560.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 4, localhost, executor driver): java.lang.ClassCastException: [Ljava.lang.Integer; cannot be cast to java.lang.Integer

然后我尝试了这个：

df = sqlContext.sql("""
select id as id,
f_2[0] as A
from raw
""")

并得到同样的错误然后尝试了这个：

df = sqlContext.sql("""
select id as id,
f_2[0][0] as A
from raw
""")

并返回此错误：

ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 0))

AnalysisException: u"Can't extract value from f_2#32685[0];"

将pandas数据帧加载到Spark集群中

0 个答案: