提前致谢!
[+]问题:
我在google云上有很多文件,我需要的每个文件都是:
我一直在使用 python2.7 和Google-Cloud-SDK。如果我线性运行,则需要小时。我被建议 Apache Beam / DataFlow 并行处理。
[+]我能做的事情:
我可以从一个文件中读取,执行PTransform并写入另一个文件。
def loadMyFile(pipeline, path):
return pipeline | "LOAD" >> beam.io.ReadFromText(path)
def myFilter(request):
return request
with beam.Pipeline(options=PipelineOptions()) as p:
data = loadMyFile(pipeline,path)
output = data | "FILTER" >> beam.Filter(myFilter)
output | "WRITE" >> beam.io.WriteToText(google_cloud_options.staging_location)
[+]我想做什么:
如何同时加载许多文件,并行执行相同的转换,然后并行写入大查询?
Diagram Of What I Wish to Perform
[+]我读过的内容:
https://beam.apache.org/documentation/programming-guide/ http://enakai00.hatenablog.com/entry/2016/12/09/104913
再次,非常感谢
答案 0 :(得分:0)
textio
接受file_pattern。
来自Python sdk:
file_pattern(str) - 从本地文件路径或GCS gs://路径读取的文件路径。路径可以包含glob字符
例如,假设您在存储*.txt
中有一堆gs://my-bucket/files/
个文件,您可以说:
with beam.Pipeline(options=PipelineOptions()) as p:
(p
| "LOAD" >> beam.io.textio.ReadFromText(file_pattern="gs://my-bucket/files/*.txt")
| "FILTER" >> beam.Filter(myFilter)
| "WRITE" >> beam.io.textio.WriteToText(output_ocation)
如果你的某个类型有多个PCollections
,你也可以Flatten将它们合并为一个
merged = (
(pcoll1, pcoll2, pcoll3)
# A list of tuples can be "piped" directly into a Flatten transform.
| beam.Flatten())
答案 1 :(得分:0)
好的,我通过以下方式解决了这个问题:
1)从某处获取一个桶的名称第一次PCollection
2)从该桶中获取blob列表第二次PCollection
3)执行FlatMap以从列表中单独获取blob第三次PCollection
4)做一个获取元数据的ParDo
5)写入BigQuery
我的管道看起来像这样:
with beam.Pipeline(options=options) as pipe:
bucket = pipe | "GetBucketName" >> beam.io.ReadFromText('gs://example_bucket_eraseme/bucketName.txt')
listOfBlobs = bucket | "GetListOfBlobs" >> beam.ParDo(ExtractBlobs())
blob = listOfBlobs | "SplitBlobsIndividually" >> beam.FlatMap(lambda x: x)
dic = blob | "GetMetaData" >> beam.ParDo(ExtractMetadata())
dic | "WriteToBigQuery" >> beam.io.WriteToBigQuery(