请让我知道我是否在错误的论坛中遇到以下问题。
我创建了以下pyspark.sql查询。
#%%
import findspark
findspark.init('/home/packt/spark-2.1.0-bin-hadoop2.7')
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('ops').getOrCreate()
df = spark.read.csv('/home/packt/Downloads/Spark_DataFrames/Person_Person.csv',inferSchema=True,header=True)
df = spark.read.csv('/home/packt/Downloads/Spark_DataFrames/Person_Password.csv',inferSchema=True,header=True)
df.createOrReplaceTempView('Person_Person')
df.createOrReplaceTempView('Person_Password')
myresults = spark.sql("""SELECT
FirstName,
LastName,
PasswordHash
FROM Person_Person
INNER JOIN Person_Password
ON BusinessEntityID = BusinessEntityID""")
myresults.show()
spark.sql查询尝试执行一个简单的内部联接。但是,它继续失败并显示以下错误:
AnalysisException Traceback (most recent call last)
<ipython-input-51-0f640112ef53> in <module>()
14 FROM Person_Person
15 INNER JOIN Person_Password
---> 16 ON BusinessEntityID = BusinessEntityID""")
17 myresults.show()
~/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/session.py in sql(self, sqlQuery)
539 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]
540 """
--> 541 return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
542
543 @since(2.0)
~/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
~/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, **kw)
67 e.java_exception.getStackTrace()))
68 if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
71 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
AnalysisException: "Reference 'BusinessEntityID' is ambiguous, could be: BusinessEntityID#639, BusinessEntityID#657.; line 7 pos 3"
有人可以让我知道我要去哪里了吗
谢谢
答案 0 :(得分:0)
当您现在编写代码时,Spark不知道BusinessEntityID
来自哪个表。您必须指定表格,每一列都是这样的:
ON Person_Person.BusinessEntityID = Person_Password.BusinessEntityID