Spark-如何根据spark注册表

时间:2016-12-09 00:04:46

标签: apache-spark pyspark spark-dataframe

我正在尝试在pyspark下面运行一段火花代码并获得错误。能帮我理解缺少的东西吗?

p1 = pd.DataFrame(final_data,columns = ['Year','Name','Sex','Count'])      
h1 = sqlContext.createDataFrame(p1)         
h1.registerTempTable('namesdb')             
sqlContext.sql("select SUBSTR(Name, 1, 1) as char1, count(Name) FROM namesdb group by char1 order by char1 ASC").toPandas()    

但我收到以下错误:

AnalysisException: u"cannot resolve 'char1' given input columns: [Year, Name, Sex, Count];

以下是final_data

的示例记录
final_data[:2]        

[[1880, 'Mary', 'F', '7065'],      
 [1880, 'Anna', 'F', '2604']

2 个答案:

答案 0 :(得分:0)

您的查询应如下所示。 Here详细说明了如何在SQL Group By中使用别名。

df1 = sqlContext.sql("select char1, count(Name) from (select *,SUBSTR(Name, 1, 1) char1 FROM namesdb) group by char1 order by char1 ASC")
df1.show()

答案 1 :(得分:0)

在SQL中,您不能将指定的列名称用作char1'在group by子句中,你可以在group by子句中重复这个函数:

select SUBSTR(Name, 1, 1) as char1, count(Name) FROM namesdb group by SUBSTR(NAME,1,1) order by char1 ASC