Question

环境：Spark 1.6，Scala

您好
我需要运行并行处理。第一个用于接收数据，第二个用于转换和保存在Hive表中。我想以1分钟的间隔重复第一个过程，以2分钟的间隔重复第二个过程。

==========First Process=== executes once per minute=============    
 val DFService = hivecontext.read
      .format("jdbc")
      .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
      .option("url", "jdbc:sqlserver://xx.x.x.xx:xxxx;database=myDB")
      .option("dbtable", "(select  Service_ID,ServiceIdentifier from myTable ) tmp")
      .option("user", "userName")
      .option("password", "myPassword")
      .load()
    DFService.registerTempTable("memTBLService")

  DFService.write.mode("append").saveAsTable("hiveTable")


=============Second Process === executes once per 2 minute =========
var DF2 = hivecontext.sql("select * from hiveTable")
var data=DF2.select(DF2("Service_ID")).distinct
data.show()

如何在Scala中以并行的间隔运行这两个进程？
谢谢侯塞因

Answer 1

编写两个单独的Spark应用程序。

然后使用cron按照您希望的时间表执行每个应用程序。或者，您可以使用Apache AirFlow来安排Spark应用程序。

有关如何将Cron与Spark一起使用，请参阅以下问题：How to schedule my Apache Spark application to run everyday at 00.30 AM(night) in IBM Bluemix?

如何在Scala和Spark中并行运行2个进程？

1 个答案: