每次在python

时间:2017-07-28 16:45:16

标签: python linux bash pyspark

我有一个pyspark脚本,如下所示。在这个脚本中,我循环遍历一个input文件,用于表名并执行代码。

现在我想在每次迭代函数mysql_spark时单独收集日志。

例如:

input file

table1
table2
table3

现在,当我执行pyspark脚本时,我在一个文件中记录了所有三个表。

What I want is 3 separate log files 1 for each table

Pyspark脚本:

#!/usr/bin/env python
import sys
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf()
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)

#Condition to specify exact number of arguments in the spark-submit command line
if len(sys.argv) != 5:
    print "Invalid number of args......"
    print "Usage: spark-submit import.py Arguments"
    exit()
args_file = sys.argv[1]
hivedb = sys.argv[2]
mysqldb=sys.argv[3]
mysqltable=sys.argv[4]

def mysql_spark(table, hivedb, mysqldb, mysqltable):

    print "*********************************************************table = {} ***************************".format(table)

    df = sqlContext.table("{}.{}".format(mysqldb, mysqltable))

    df.registerTempTable("mytempTable")

    sqlContext.sql("create table {}.{} as select * from mytempTable".format(hivedb,table))

input = sc.textFile('/user/XXXXXXXX/mysql_spark/%s' %args_file).collect()

for table in input:
    mysql_spark(table, hivedb, mysqldb, mysqltable)

sc.stop()

Shell脚本调用pyspark脚本文件运行。

#!/bin/bash

source /home/$USER/mysql_spark/source.sh
[ $# -ne 1 ] && { echo "Usage : $0 table ";exit 1; }

args_file=$1

TIMESTAMP=`date "+%Y-%m-%d"`
touch /home/$USER/logs/${TIMESTAMP}.success_log
touch /home/$USER/logs/${TIMESTAMP}.fail_log
success_logs=/home/$USER/logs/${TIMESTAMP}.success_log
failed_logs=/home/$USER/logs/${TIMESTAMP}.fail_log

#Function to get the status of the job creation
function log_status
{
       status=$1
       message=$2
       if [ "$status" -ne 0 ]; then
                echo "`date +\"%Y-%m-%d %H:%M:%S\"` [ERROR] $message [Status] $status : failed" | tee -a "${failed_logs}"
                exit 1
                else
                    echo "`date +\"%Y-%m-%d %H:%M:%S\"` [INFO] $message [Status] $status : success" | tee -a "${success_logs}"
                fi
}

spark-submit --name "${args_file}" --master "yarn-client" /home/$USER/mysql_spark/mysql_spark.py ${args_file} ${hivedb} ${mysqldb} ${mysqltable} 

g_STATUS=$?
log_status $g_STATUS "Spark job ${args_file} Execution"

Sample log file:

Connection to spark
***************************table = table 1 ********************************
created dataframe
created table
delete temp directory
***************************table = table 2 ********************************
created dataframe
created table
delete temp directory
***************************table = table 3 ********************************
created dataframe
created table
delete temp directory

Expected output

table1.logfile

Connection to spark
***************************table = table 1 ********************************
created dataframe
created table
delete temp directory   

table2.logfile

***************************table = table 1 ********************************
created dataframe
created table
delete temp directory   

table3.logfile

***************************table = table 1 ********************************
created dataframe
created table
delete temp directory
shutdown sparkContext   

我怎样才能做到这一点?

是否可以这样做?

1 个答案:

答案 0 :(得分:1)

您可以为每次迭代创建新文件并将数据写入其中。

这是一个简单的例子:

lis =['table1','table2']

for table in lis:
    logfile = open(str(table)+".logfile",'w')
    logfile.write(str(table))
    logfile.close()

在您的代码中,如果您实现相同的概念并将文件对象传递给mysql_spark函数,则每次迭代都应该有效。

for table in input:
    logfile = open(str(table)+".logfile",'w')
    mysql_spark(table, hivedb, mysqldb, mysqltable, logfile)
    logfile.close()