如何在另一个pyspark应用程序中访问全局临时视图?

时间:2018-12-18 06:47:31

标签: apache-spark pyspark apache-spark-sql

我有一个spark壳,该壳调用pyscript并创建了一个全局临时视图

这是我在第一个spark shell脚本中所做的

from pyspark.sql import SparkSession

spark = SparkSession \
.builder \
.appName("Spark SQL Parllel load example") \
.config("spark.jars","/u/user/graghav6/sqljdbc4.jar") \
.config("spark.dynamicAllocation.enabled","true") \
.config("spark.shuffle.service.enabled","true") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.sql.shuffle.partitions","50") \
.config("hive.metastore.uris", "thrift://xxxxx:9083") \
.config("spark.sql.join.preferSortMergeJoin","true") \
.config("spark.sql.autoBroadcastJoinThreshold", "-1") \
.enableHiveSupport() \
.getOrCreate()

#after doing some transformation I am trying to create a global temp view of dataframe as:

df1.createGlobalTempView("df1_global_view")
spark.stop()
exit()

这是我的第二个Spark Shell脚本:

from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Spark SQL Parllel load example") \
.config("spark.jars","/u/user/graghav6/sqljdbc4.jar") \
.config("spark.dynamicAllocation.enabled","true") \
.config("spark.shuffle.service.enabled","true") \
.config("hive.exec.dynamic.partition", "true") \
.config("hive.exec.dynamic.partition.mode", "nonstrict") \
.config("spark.sql.shuffle.partitions","50") \
.config("hive.metastore.uris", "thrift://xxxx:9083") \
.config("spark.sql.join.preferSortMergeJoin","true") \
.config("spark.sql.autoBroadcastJoinThreshold", "-1") \
.enableHiveSupport() \
.getOrCreate()

newSparkSession = spark.newSession()
#reading dta from the global temp view
data_df_save = newSparkSession.sql(""" select * from global_temp.df1_global_view""")
data_df_save.show()

newSparkSession.close()
exit()

我遇到以下错误:

Stdoutput pyspark.sql.utils.AnalysisException: u"Table or view not found: `global_temp`.`df1_global_view`; line 1 pos 15;\n'Project [*]\n+- 'UnresolvedRelation `global_temp`.`df1_global_view`\n"

好像我缺少什么。如何在多个会话中共享相同的全局临时视图? 我是否在第一个Spark Shell中错误地关闭了Spark会话? 我已经在堆栈溢出中找到了几个答案,但是无法找出原因。

1 个答案:

答案 0 :(得分:1)

您使用的是LOCAL_AIDL_INCLUDES += $(A_APP_PATH)/src ,因此它是一个临时视图,在您关闭应用程序后将无法使用。

换句话说,它将在另一个df = pd.DataFrame({'A':['here goes the title', 'tt', 'we have title here'], 'B': ['ty', 'title', 'complex']}) df +---+---------------------+---------+ | | A | B | +---+---------------------+---------+ | 0 | here goes the title | ty | | 1 | tt | title | | 2 | we have title here | complex | +---+---------------------+---------+ idx = df.apply(lambda x: x.str.contains('title')) col_idx = [] for i in range(df.shape[1]): col_idx.append(df.iloc[:,i][idx.iloc[:,i]].index.tolist()) out = [] cnt = 0 for i in col_idx: for j in range(len(i)): out.append((i[j], cnt)) cnt += 1 out # [(0, 0), (2, 0), (1, 1)] # Expected output 中可用,但在另一个PySpark应用程序中不可用。