Question

这里是完整问题的描述。我需要尽快解决此问题。

背景：我正在尝试运行一个简单的python flask应用程序，在该应用程序中，我会在读取每个CSV文件时创建一个新的spark会话。这是为了测试在一个spark上下文中可以并行运行多少个spark会话。

在烧瓶代码中，我设置了Rest API“部署”，在其中传递了一个CSV文件，烧瓶代码将 1.为每个CSV文件创建一个新的Spark Session 2.读取CSV文件 3.打印CSV文件中的记录数。

配置： -我在VM上使用HDP 3.1.0，Spark版本为2.3.3 -设置了python虚拟环境，并正确设置了参数HADOOP_CONF_DIR，YARN_CONF_DIR，SPARK_MAJOR_VERSION和SPARK_HOME。 -在虚拟环境中设置了必要的软件包，例如Flask软件包。

观察： -在设置示例XML文件以进行公平调度并设置Spark配置参数时将spark.scheduling.mode和spark.scheduler.allocation.file分别设置为“ FAIR”和XML名称，
一个spark会话以公平模式运行，其余所有会话以FIFO模式运行。

由于上述原因，我无法准确评估每个Spark上下文可以产生多少个Spark会话。
为了在集群模式下测试相同的Flask应用程序，我执行了以下附加配置：
1. 在Ambari中创建了一个新的Yarn Spark队列，并在Default队列和新队列同样。
2. 修改了“ yarn-site.xml” ，将yarn.resourcemanager.scheduler.class参数更改为FairScheduler，并创建了一个新的XML文件“ fair-scheduler.xml”，并配置了新纱线队列的最小和最大资源。
在Flask python代码中，我设置了以下参数： yarn.spark.queue设置为创建的新队列
除上述内容外，我还设置了一些其他与Fair调度有关的配置参数。

问题：尽管代码可以正常编译，但该作业根本没有运行。当我将CSV文件传递到flask代码时，通常会显示各个Spark阶段。

但是，它不执行任何操作，并且代码永远运行。

我的问题 1.任何人都可以帮助我进行完整的设置和配置一种。单独的纱线队列，支持公平调度 b。要在yarn-site.xml中进行的相关更改 C。 fair-scheduler.xml的绝对必要的内容

我没有运行“ spark-submit”，而是运行了“ python”文件，并且在内部生成了必要的spark-submit命令。在代码中，我在代码本身中传递了队列名称。需要传递所有Spark配置参数以确保作业本身以公平调度模式运行。
最后，为什么作业没有运行？我是否错过任何配置？

您的意见将有很大帮助。

代码：

from flask import Flask
from flask import request
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark import SparkContext
from pyspark.sql.functions import udf
import os
import time

#Flask constructor. This takes the current name of the module as argument
app = Flask(__name__)

#The atexit module provides a simple interface to register functions to be called when
#a program closes down normally.
import atexit


def exit_handler():
    spark.stop()

#The commented configs all ran fine for standalone mode 

sparkSessionDict = {}
#spark = SparkSession.builder.appName("Sudhindra").getOrCreate()
sc_conf = SparkConf()
sc_conf.setAppName("Sudhindra")
sc_conf.setMaster("yarn")
sc_conf.set("spark.yarn.queue","sparkQueue")
#sc_conf.setMaster("local[*]")
#sc_conf.set("spark.scheduler.mode","FAIR")
#sc_conf.set("spark.scheduler.allocation.file","test.xml")
sc_conf.set("spark.driver.memory","2g")
#sc_conf.set("spark.executor.memory","4g")
#sc_conf.set("spark.executor.cores",5)
#sc_conf.set("spark.num.executors",2)
sc_conf.set("spark.eventLog.enabled","true")

print sc_conf.getAll()

sc = SparkContext(conf=sc_conf)
driver_memory = sc._conf.get('spark.driver.memory')
exec_memory = sc._conf.get('spark.executor.memory')
cores = sc._conf.get('spark.executor.cores')
executors = sc._conf.get('spark.num.executors')
print("Driver Memory = ",driver_memory)
print("Executor Memory = ",exec_memory)
print("Cores = ",cores)
print("Executors = ",executors)

sc.setLocalProperty("spark.scheduler.pool", "fair_pool")
#spark = SparkSession().builder.config(conf=sc.getConf).getOrCreate()
spark=SparkSession(sc)

#spark = SparkSession.builder.appName("Sudhindra").master("local[*]").config("spark.scheduler.pool","fair_pool").config("spark.driver.memory","2g").config("spark.executor.memory","10g").config("spark.executor.cores",5).config("spark.num.executors",2).config("spark.eventLog.enabled","true").config("spark.scheduler.allocation.file","test.xml").config("spark.scheduler.mode","FAIR").getOrCreate()

@app.route("/deploy", methods=["GET"])
def register():
    start = time.time()
    read_file_name = request.args.get('file', default=0, type=str)
    print(read_file_name)
    sparkSessionDict[read_file_name] = spark.newSession()
    path = "/user/root/sudhindra/"+str(read_file_name)+".csv"
    df = sparkSessionDict[read_file_name].read.format("csv").load(path)
    count = df.count()
    end = time.time()
    return "Start Time = {}sec \nEnd Time = {}sec \nTime = {}sec -- \nCount = {}".format(start,end,(end - start),count)


@app.route('/')
def hello():
        return "Hello World!"

if __name__ == '__main__':
    atexit.register(exit_handler)
    app.run(host='<IP Address of the localhost>',debug=True,threaded=True)

公平池调度

<configuration  xmlns:xi="http://www.w3.org/2001/XInclude">

<allocations>
  <queue name="sparkQueue"> --New Spark Queue
    <minResources>1000 mb,0vcores</minResources>
    <maxResources>8000 mb,0vcores</maxResources>
    <maxRunningApps>50</maxRunningApps>
    <maxAMShare>0.1</maxAMShare>
    <weight>2.0</weight>
    <schedulingPolicy>fair</schedulingPolicy>
  </queue>

  <queuePlacementPolicy>
    <rule name="specified" />
    <rule name="default" queue="sparkQueue"/>
  </queuePlacementPolicy>
</allocations>

在创建的新队列下，Spark作业未在Yarn群集上运行

0 个答案: