Question

我正在Google Data Flow上使用Apache Beam。

我的管道从BigQuery读取数据，但这取决于执行参数。我应该能够以一个点（经度，纬度）和几个点来运行管道。

只有一点，解决方案很简单：我可以将查询设置为ValueProvider。

select * 
from UserProfile 
where id_ in ( select distinct userid 
               from   locations 
               where  ST_DWITHIN(ST_GeogPoint(longitude, latitude),
                                 ST_GeogPoint(10.9765,50.4322),
                                 300)
             )

问题是当我有1个以上的点要为其运行查询时。我尝试对每个点应用BigQuery读取并将结果合并到一个PCollection中，但是我不知道如何将点传递到管道并动态地构建。

Answer 1

一种方法是将这些地理位置先放入表格中（让我们说my_points_table），然后在子查询中获取它们：

select * from UserProfile where id_ in 
   (
     select distinct userid from locations l 
     left outer join my_points_table t on 1=1
     where 
      ST_DWITHIN(
        ST_GeogPoint(l.longitude, l.latitude),
        ST_GeogPoint(t.longitude, t.latitude),
      300)
   )

Answer 2

如果点数不是太大（我想说少于一千），则运行此查询的一种简单方法是提供一个带有WKT点集描述的字符串：

select * 
from UserProfile 
where id_ in ( 
    select distinct userid 
    from   locations 
    where ST_DWITHIN(ST_GeogPoint(longitude, latitude),
                     ST_GeogFromText("MULTIPOINT((10.9765 50.4322), (10 50))"),
                     300)
    )

WKT字符串应易于在您的代码中构建。

如果有趣点的数量更多，我将使用一个点表，并在位置表和有趣点表之间进行JOIN：

select * 
from UserProfile 
where id_ in ( 
    select distinct userid 
    from   locations as l, interesting_points as p
    where ST_DWITHIN(ST_GeogPoint(l.longitude, l.latitude),
                     p.point,
                     300)
    )

如何实现动态BigQueryIO输入

2 个答案: