Google数据流-从另一个PCollection <string>中排除一个PCollection <string>

时间:2018-07-27 20:01:20

标签: google-cloud-dataflow apache-beam

我有两个如下的P系列

P1 = ['H','E','L','L','O','W','O','R','L','D']

P2 = ['W','E','L','C','O','M','E']

我想从第一个集合中排除元素(如果存在的话),以便从下面的第二个集合中获取结果

Result = ['H','R','D']

什么是最快,最优化的方法?

1 个答案:

答案 0 :(得分:1)

使用Authorizationhttps://beam.apache.org/documentation/programming-guide/#combine

Python:https://beam.apache.org/documentation/sdks/pydoc/2.5.0/apache_beam.transforms.core.html?highlight=combineperkey#apache_beam.transforms.core.CombinePerKey

Java:https://beam.apache.org/documentation/sdks/javadoc/2.5.0/org/apache/beam/sdk/transforms/Combine.PerKey.html

  1. 像这样将P1和P2转换为元组:

代码:

  public myPost(body) {
    const httpOptions = {
      headers: new HttpHeaders({
        'Authorization': '?????'
      }),
      withCredentials: true
    }
      return this.http.post("http://localhost:9090/api/values", body, httpOptions);
  }
  1. 将2个p集合放在一起

  2. 将展平的p集合传递到CombinePerKey中,并带有P1 = [('H', 'P1'), ('E', 'P1'), ('L', 'P1'), ('L', 'P1'), ('O', 'P1'), ('W', 'P1'), ('O', 'P1'), ('R', 'P1'), ('L', 'P1'), ('D', 'P1')] P2 = [('W', 'P2'), ('E', 'P2'), ('L', 'P2'), ('C', 'P2'), ('O', 'P2'), ('M', 'P2'), ('E', 'P2')] 来标记p1和p2中是否都包含字符串:

代码:

CombinePerKey
  1. CombineFn中过滤出具有class IsInBoth(apache_beam.core.CombineFn): def _add_inputs(self, elements, accumulator=None): accumulator = accumulator or self.create_accumulator() for obj in elements: if obj == 'P1': accumulator['P1'] = True if obj == 'P2': accumulator['P2'] = True return accumulator def create_accumulator(self): return {'P1': False, 'P2': False} def add_input(self, accumulator, element, *args, **kwargs): return self._add_inputs(elements=[element], accumulator=accumulator) def add_inputs(self, accumulator, elements, *args, **kwargs): return self._add_inputs(elements=elements, accumulator=accumulator) def merge_accumulators(self, accumulators, *args, **kwargs): return { 'P1': any([i['P1'] for i in accumulators]), 'P2': any([i['P2'] for i in accumulators])} def extract_output(self, accumulator, *args, **kwargs): return accumulator 的结果