无法使用pyspark.sql执行内部联接

时间:2018-08-01 23:29:52

标签: pyspark-sql

请让我知道我是否在错误的论坛中遇到以下问题。

我创建了以下pyspark.sql查询。

#%%
import findspark
findspark.init('/home/packt/spark-2.1.0-bin-hadoop2.7')
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ops').getOrCreate()
df = spark.read.csv('/home/packt/Downloads/Spark_DataFrames/Person_Person.csv',inferSchema=True,header=True)
df = spark.read.csv('/home/packt/Downloads/Spark_DataFrames/Person_Password.csv',inferSchema=True,header=True)
df.createOrReplaceTempView('Person_Person')
df.createOrReplaceTempView('Person_Password')
myresults = spark.sql("""SELECT
FirstName,
LastName,
PasswordHash
FROM Person_Person
INNER JOIN Person_Password
ON BusinessEntityID = BusinessEntityID""")
myresults.show()

spark.sql查询尝试执行一个简单的内部联接。但是,它继续失败并显示以下错误:

AnalysisException                         Traceback (most recent call last)
<ipython-input-51-0f640112ef53> in <module>()
     14 FROM Person_Person
     15 INNER JOIN Person_Password
---> 16 ON BusinessEntityID = BusinessEntityID""")
     17 myresults.show()

~/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/session.py in sql(self, sqlQuery)
    539         [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]
    540         """
--> 541         return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
    542 
    543     @since(2.0)

~/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

~/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, **kw)
     67                                              e.java_exception.getStackTrace()))
     68             if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
     70             if s.startswith('org.apache.spark.sql.catalyst.analysis'):
     71                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)

AnalysisException: "Reference 'BusinessEntityID' is ambiguous, could be: BusinessEntityID#639, BusinessEntityID#657.; line 7 pos 3"

有人可以让我知道我要去哪里了吗

谢谢

1 个答案:

答案 0 :(得分:0)

当您现在编写代码时,Spark不知道BusinessEntityID来自哪个表。您必须指定表格,每一列都是这样的:

ON Person_Person.BusinessEntityID = Person_Password.BusinessEntityID