Luigi HiveQueryTask将结果保存到文件中

时间:2018-03-26 19:11:57

标签: python hadoop luigi

我对Luigi和Python非常缺乏经验,但我试图确定为什么Hive查询的结果没有保存到指定的输出文件中。我认为相关的是通过query()方法执行并定义save()中的保存位置:

class deliverableweekValues(HiveQueryTask):
    tablename = luigi.Parameter(default='basetable')
    database = luigi.Parameter(default='base_database')
    #needs to return something as output, for subsequent splitting of tasks
    def output(self):
        return luigi.LocalTarget('weeks_for_metrics.txt')
    #this determine what weeks are available to run metrics on
    def query(self):
        tmpl = """
SELECT DISTINCT week FROM {0}.{1} ORDER BY week
    """
        qry = tmpl.format(self.database,self.tablename)
        print (qry)
        return qry

#run only the above function via line below, ***WORKS***
luigi.run(['deliverableweekValues','--local-scheduler'])

这不是错误,它只是不保存到名为weeks_for_metrics.txt的文件中。 路易吉输出:

DEBUG: Checking if deliverableweekValues(tablename=basetable, database=base_database) is complete
/usr/lib/python2.7/site-packages/luigi/parameter.py:259: UserWarning: Parameter None is not of type string.
  warnings.warn("Parameter {0} is not of type string.".format(str(x)))
INFO: Informed scheduler that task   deliverableweekValues_base_database_basetable_96cd7fa3f7   has status   PENDING
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1

SELECT DISTINCT week FROM base_database.basetable ORDER BY week

INFO: ['hive', '-f', '/tmp/tmpEaQmC4', '--hiveconf', "mapred.job.name='deliverableweekValues_base_database_basetable_96cd7fa3f7'"]
INFO: hive -f /tmp/tmpEaQmC4 --hiveconf mapred.job.name='deliverableweekValues_base_database_basetable_96cd7fa3f7'
INFO: log4j:WARN No such property [maxFileSize] in org.apache.log4j.DailyRollingFileAppender.
INFO: Logging initialized using configuration in file:/etc/hive/2.6.1.0-129/0/hive-log4j.properties
INFO: Query ID = jdavidson_20180326153035_c2fdf713-037f-4c79-93f6-0328fe57d208
INFO: Total jobs = 1
INFO: Launching Job 1 out of 1
INFO: Status: Running (Executing on YARN cluster with App id application_1520857775863_57954)
INFO: OK
INFO: Time taken: 30.451 seconds, Fetched: 122 row(s)
DEBUG: 1 running tasks, waiting for next task to finish
INFO: Informed scheduler that task   deliverableweekValues_base_database_basetable_96cd7fa3f7   has status   DONE
DEBUG: Asking scheduler for work...
DEBUG: Done
DEBUG: There are no more tasks to run at this time
INFO:
===== Luigi Execution Summary =====

Scheduled 1 tasks of which:
* 1 ran successfully:
    - 1 deliverableweekValues(tablename=basetable, database=base_database)

This progress looks :) because there were no failed tasks or missing external dependencies

===== Luigi Execution Summary =====

True

但是,尽管完成没有错误并且获取了正确的122行,但该目录中不存在该文件。

ls week*
ls: cannot access week*: No such file or directory

从luigi文档中写入的示例

我已经使用以下示例代码以这种方式成功写入文件,所以我相信我错过了一些相当基本的东西:

class GenerateWords(luigi.Task):
    def output(self):
        return luigi.LocalTarget('words.txt')
    def run(self):
        # write a dummy list of words to output file
        words = ['apple','banana','grapefruit']
        with self.output().open('w') as f:
            for word in words:
                f.write('{word}\n'.format(word=word))

class CountLetters(luigi.Task):
    def requires(self):
        return GenerateWords()
    def output(self):
        return luigi.LocalTarget('letter_counts.txt')
    def run(self):
        # read in file as list
        with self.input().open('r') as infile:
            words = infile.read().splitlines()
            # write each word to output file with its corresponding letter count
        with self.output().open('w') as outfile:
            for word in words:
                outfile.write(
                    '{word} | {letter_count}\n'.format(
                        word=word,
                        letter_count=len(word)
                    )
                )

#run via
luigi.run(['CountLetters','--local-scheduler'])

这会创建文件:

ls w*
words.txt
ls letter*
letter_counts.txt

0 个答案:

没有答案