无法解析数据框中的列名 - pyspark 1.6

时间:2017-05-03 18:02:17

标签: python apache-spark pyspark

我们从hive表创建的数据帧很少。当它们之间连接两个数据帧时,抛出pyspark.sql.utils.AnalysisException:u'无法解析列名“。但是prinSchema()正确打印列。我不能提供整个代码,但用于加入的数据框

 >>> scoreitemavail.printSchema()
 root
   |-- key: string (nullable = true)
   |-- item_nbr: string (nullable = true)
   |-- cat_nbr: string (nullable = true)
   |-- score: double (nullable = true)

>>> itemagg=scoreitemavail.groupBy('tem_nbr','cat_nbr').agg({"score":"count"})
>>> itemagg.printSchema()
root
  |-- item_nbr: string (nullable = true)
  |-- cat_nbr: string (nullable = true)
  |-- count(score): long (nullable = false)
>>>subcatagg=scoreitemavail.groupBy('cat_nbr').agg(F.countDistinct('key').alias('count_cat'))
>>> subcatagg.printSchema()
root
  |-- cat_nbr: string (nullable = true)
  |-- count_cat: long (nullable = false)

>>>itemsubcatjoin1=subcatagg.join(itemagg,itemagg['cat_nbr']==subcatagg['cat_nbr'],'inner')

17/05/03 17:55:47 WARN Column: Constructing trivially true equals predicate, 'cat_nbr#43 = cat_nbr#43'. Perhaps you need to use aliases.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/dataframe.py", line 653, in join
jdf = self._jdf.join(other._jdf, on._jc, how)
File "/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/utils.py", line 51, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Cannot resolve column name "cat_nbr" among (_col7, _col6, count(score));'

任何建议都会有所帮助。谢谢!

0 个答案:

没有答案