我试图在spark 1.6中使用SQLContext.sql从别名中做一个简单的选择。
sqlCtx = SQLContext(sc)
## Import CSV File
header = (sc.textFile("data.csv")
.map(lambda line: [x for x in line.split(",")]))
## Convert RDD to DF, specify column names
headerDF = header.toDF(['header', 'adj', 'desc'])
## Convert Adj Column to numeric
headerDF = headerDF.withColumn("adj", headerDF['adj'].cast(DoubleType()))
headerDF.registerTempTable("headerTab")
head = sqlCtx.sql("select d.desc from headerTab as d").show()
我注意到这似乎在Spark 2.0中有效,但我目前仅限于1.6。
这是我看到的错误消息。对于一个简单的选择,我可以删除别名,但最终我尝试与多个具有相同列名的表进行连接。
Spark 1.6错误
Traceback (most recent call last):
File "/home/temp/text_import.py", line 49, in <module>
head = sqlCtx.sql("select d.desc from headerTab as d").show()
File "/home/pricing/spark-1.6.1/python/lib/pyspark.zip/pyspark/sql/context.py", line 580, in sql
File "/home/pricing/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/home/pricing/spark-1.6.1/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/home/pricing/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o64.sql.
: java.lang.RuntimeException: [1.10] failure: ``*'' expected but `desc' found
Spark 2.0返回
+--------+
| desc|
+--------+
| data|
| data|
| data|
答案 0 :(得分:3)
基本上您在列名中使用关键字desc
,这是不合适的。您可以通过两种方式解决此问题:更改列名称或在关键字desc周围使用符号(`)。
方式1: -
sqlCtx = SQLContext(sc)
## Import CSV File
header = (sc.textFile("data.csv")
.map(lambda line: [x for x in line.split(",")]))
## Convert RDD to DF, specify column names
headerDF = header.toDF(['header', 'adj', 'description'])
## Convert Adj Column to numeric
headerDF = headerDF.withColumn("adj", headerDF['adj'].cast(DoubleType()))
headerDF.registerTempTable("headerTab")
head = sqlCtx.sql("select d.description from headerTab as d").show()
方式2: -
sqlCtx = SQLContext(sc)
## Import CSV File
header = (sc.textFile("data.csv")
.map(lambda line: [x for x in line.split(",")]))
## Convert RDD to DF, specify column names
headerDF = header.toDF(['header', 'adj', 'desc'])
## Convert Adj Column to numeric
headerDF = headerDF.withColumn("adj", headerDF['adj'].cast(DoubleType()))
headerDF.registerTempTable("headerTab")
head = sqlCtx.sql("select d.`desc` from headerTab as d").show()
答案 1 :(得分:1)
正如问题下面的评论中所述,使用desc是不合适的,因为它是一个关键字。更改列的名称可以解决问题。
## Convert RDD to DF, specify column names
headerDF = header.toDF(['header', 'adj', 'descTmp'])
## Convert Adj Column to numeric
headerDF = headerDF.withColumn("adj", headerDF['adj'].cast(DoubleType()))
headerDF.registerTempTable("headerTab")
head = sqlCtx.sql("select d.descTmp from headerTab as d").show()
+-----------+
| descTmp|
+-----------+
| data|
| data|
| data|