Apache Beam Python ReadFromText Regex

时间:2018-03-07 14:57:28

标签: python apache-beam

我有一种情况,我希望在两天内从GCS读取数据。我的文件夹结构是sensors/<date>/<hash>/x.csv.gz,我希望能够阅读文件“2017”和“真实”。和&#39; 20171105&#39;。使用正则表达式sensors/[20171104,20171105]/<hash>/*不起作用。有没有人知道使用beam.io.ReadFromText函数处理这个问题的最佳方法?

1 个答案:

答案 0 :(得分:0)

我已经想出了如何在不使用通配符的情况下读取数据的预期天数,而是通过编写python函数。我们的想法是创建一个包含所有读取操作的数组,然后展平数组并将其用作管道的输入。

    def read_files(pipeline, intended_day):

        collections = []
        previous_day = (datetime.strptime(intended_day, '%Y%m%d') - timedelta(days=1)).strftime('%Y%m%d')

        days = [intended_day, previous_day]
        path = "gs://sensors/{}/<hash>/*"
        for day in days:
            try:
                file_name = path.format(day)
                collection = pipeline | ('Read Past for %s' % day) >> beam.io.ReadFromText(file_name)
                collections.append(collection)
            except IOError:
                logging.error("Failed to read for day %s" % day)

        return collections

然后在你的管道中调用你的函数:

p = beam.Pipeline(runner=runner, argv=argv)
intended_day = "20170810"
pcollections = read_files(p, intended_day)
result = ((pcollections | "Flatten sensor" >> beam.Flatten())
           | .....
         )