create_spark.py

代码是这样的

spark_conf = (SparkConf().setAppName(app_name)
                              .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
                              .set("spark.task.maxFailures", "14")
                              .set("spark.port.maxRetries", "50")
                              .set("spark.yarn.max.executor.failures", "14"))



spark_context = SparkContext(conf=spark_conf)
sqlContext=HiveContext(spark_context)

然后是另一个包含所有代码的文件。命名为function_file.py

它必须具有以下功能：该功能仅对数据进行一些操作。

def adjust_name(line):
    if line is not None:
        if "(" in line:
            if "\(" in line:
                tem1 = line.split("\(")
                return tem1[0]
            else:
                tem1 = line.split("(")
                return tem1[0]
        else:
            return line
    else:
        return line

现在，我们将adjust_name函数的udf创建为。

adjust=udf(adjust_name,StringType())

，并且我们在process_sql函数中将此udf用作

和另一个执行所有表加载以及全部的功能。例如 e

def process_sql(sqlContext,source_db,processing_db,table_name):
    .
    .
    .df3 = df3.withColumn('org_name',trim(adjust(df3['col_name'])))
    return table_name.

，现在在create_spark.py文件中，我将function_file作为模块导入。并且我将process_sql函数称为

x= function_file.process_sql(sqlContext,source_db,processing_db,table_name)

所有参数都是预先定义的。但是我遇到了类似的错误：

ValueError：无法一次运行多个SparkContext。 udf在function_file.py
创建的现有SparkContext（）

注意：我只能使用spark 1.6

编辑：我有一个线索，即使在创建create_spark.py文件之前，UDF仍在创建sparkcontext。

:Connecting to Spark and creating context with dim_emp_atsc_test_4_sept spark_context = SparkContext(conf=spark_conf) ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=process_handler.py, master=yarn-client) created by udf at ..

Answer 1

我认为问题出在HiveContext和SparkContext。

尝试仅使用其中之一，或者在创建HiveContext传递SparkContext作为构造函数参数时使用。

如何解决“无法创建多个Sparkcontext错误”？

create_spark.py

注意：我只能使用spark 1.6

编辑：我有一个线索，即使在创建create_spark.py文件之前，UDF仍在创建sparkcontext。

1 个答案: