Question

我获得了Google云端桶的网址。我必须：

使用该URL获取该存储桶中的blob列表
对于每个blob，我进行一些GCS API调用以获取有关blob的信息（blob.size，blob.name等）
对于每个blob，我还必须阅读它，在其中找到一些内容并将其添加到从GCS API调用中获取的值
对于每个blob，我必须将步骤2和3中找到的关于Blob的值写入BigQuery

我有成千上万的斑点，所以这需要用Apache光束完成（我已被推荐）

我对管道的想法是这样的：

GetUrlOfBucket并进行PCollection

使用该PCollection获取blob列表作为新的PCollection

使用这些blob的元数据创建PCollection

执行转换，该转换将接收作为元数据值字典的PCollection，进入blob，扫描值并返回新的PCollection，该PCollection是元数据值和此新值的字典

将此内容写入BigQuery。

我特别难以考虑如何将字典作为PCollection返回

[+]我读过的内容：

https://beam.apache.org/documentation/programming-guide/#composite-transforms

https://medium.com/@rajeshhegde/data-pipeline-using-apache-beam-python-sdk-on-dataflow-6bb8550bf366

非常感谢任何建议，特别是有关如何获取该存储桶名称并返回blool的PCollection的建议。

Answer 1

我通过阅读更多关于apache-beam并确定我必须使用ParDo函数在我的资源之间拆分作业来解决这个问题，在ParDo中我调用我的DoFn函数来接收元素并执行所有处理需要它并产生一个dic。请参阅此帖Apache Beam: How To Simultaneously Create Many PCollections That Undergo Same PTransform?

    class ExtractMetadata(beam.DoFn):                                                                                                                                                                                                                                                  
def process(self, element):                                                                                                                                                                                                                                                    
    """                                                                                                                                                                                                                                                                        
    Takes in a blobName, fetches the blob and its values and returns a dictionary of values                                                                                                                                                                                    
    """                                                                                                                                                                                                                                                                        
    metadata = element.metadata                                                                                                                                                                                                                                                
    if metadata is not None:                                                                                                                                                                                                                                                   
        event_count = int(metadata['count'])                                                                                                                                                                                                                                   
    else:                                                                                                                                                                                                                                                                      
        event_count = None                                                                                                                                                                                                                                                     

    event_type = self.determine_event_type(element.id)                                                                                                                                                                                                                         
    cluster    = self.determine_cluster(element.id)                                                                                                                                                                                                                            
    customer   = self.determine_customer(element)                                                                                                                                                                                                                              
   # date = datetime.strptime(element.time_created, '%a, %d %b %Y %H:%M:%S')                                                                                                                                                                                                   
    #date = date.isoformat()                                                                                                                                                                                                                                                   
    dic = {                                                                                                                                                                                                                                                                    
        'blob_name'       : element.name,                                                                                                                                                                                                                                      
        'event_path'      : element.path,                                                                                                                                                                                                                                      
        'size'            : int(element.size),                                                                                                                                                                                                                                 
        'time_of_creation': element.time_created.isoformat(),                                                                                                                                                                                                                  
        'event_count'     : event_count,                                                                                                                                                                                                                                       
        'event_type'      : event_type,                                                                                                                                                                                                                                        
        'cluster'         : cluster,                                                                                                                                                                                                                                           
        'customer'        : customer                                                                                                                                                                                                                                           
    }                                                                                                                                                                                                                                                                          
    yield dic

如何将字典作为PCollection返回？

1 个答案: