Question

我有一个shell脚本，该脚本连接到beeline并从100个表中收集数据并将该数据转储到单个表中。在这里，我的Shell脚本达到了5条直线，因此完成该过程花费的时间太长，因此建议我使用Spark Shell而不是Shell脚本

我尝试使用shell脚本来获取数据，但是事实证明这很耗时，大约需要5个小时，因此我被要求使用spark来获取数据。该代码需要每天运行。

我使用了以下代码

beeline -u " connection details " -- show header =false -- outputformat=tsv2 - hivevar dbname=${dbname} - e "show partition ${table_name}; > {tableProcess}.tmp


for line in `cat${tableprocess}.tmp| tail -1( for getting latest partition)
 do 

part_year = I have found a way to extract the year

part_month= same
part_day = same

然后，在所有这些之后，我已经在上面使用beeline逐一从100个表收集数据。

beeline -u "connection details" select count(*) from {table_process} where year= part_year and  month=part_month and day=part_day;> {tableProcess}.count

我还使用beeline -u 4倍来使用分区日期来获取其他详细信息

获取各种信息后

I have used printf"${dbname},${table_name},{tableProcess}.count and more >> metadata.txt.

此txt文件保存在hadoop位置，我根据该文件创建了一个外部表。

现在我需要在上面的代码中加入spark，请帮助我如何将其转换为完整的spark代码，从而减少处理时间。

框架更改-将Shell脚本转换为Spark代码

0 个答案: