Question

我有两个公共数据集，分别是1）自行车出行和2）车站，我需要在这些数据集中找到平均旅行距离最高的车站。我已经加入了两个数据集中的表格

bigquery-public-data.london_bicycles.cycle_hire
bigquery-public-data.london_bicycles.cycle_stations

每次旅行都有一个起点站和一个终点站。行程通常会具有与起点站不同的终点站，因此每次的距离都不同。想要找出哪个起点站到终点站的行程/距离平均最长，这意味着对于骑自行车的人来说，平均行程最远的终点站。

我希望1）输出是这样的：

Trip  Start_station_coordinate  start_st_name  end_station_coordinate  end_st_name   km_dist
 1    POINT(-0.123 51.123)      A-station      POINT(-0.123 51.123)     B-station      ??
 2    POINT(-0.123 51.123)      C-station      POINT(-0.123 51.123)     D-station      ??
 3    POINT(-0.123 51.123)      D-station      POINT(-0.123 51.123)     F-station      ??

...和2）由start_station分组，该start_station每次旅行的平均km_distance最高。像这样：

start_station   average_distance_descending

  A-station     20 km      
  B-station     15 km 
  C-station     3  km

我的代码是JOIN，我无法将以上内容合并到我的查询中（因为我是sql的新手）。我在问题行的最后尝试了以下方法：

 `SELECT ST_GeogPoint(stations1.longitude, stations1.latitude) as WKT1
   ,stations1.id
   ,ST_GeogPoint(stations2.longitude, stations2.latitude) as WKT2
   ,stations2.id as id_2  
   ,trips.end_station_id
   ,trips.start_station_id
   from bigquery-public-data.london_bicycles.cycle_hire as trips
   Inner JOIN bigquery-public-data.london_bicycles.cycle_stations as stations1
   ON trips.start_station_id = stations1.id 
   Inner JOIN bigquery-public-data.london_bicycles.cycle_stations as stations2
   ON trips.end_station_id = stations2.id
   order by AVG(st_distance(WKT1, WKT2))`

BigQuery说：“ 仅在[22：5]出现GROUP BY或SELECT列表聚合的情况下，ORDER BY子句才允许聚合” ，请参考最后一行。我一直在绞尽脑汁寻找如何找到最大的平均距离（如果可能的话）以及如何将其结合到JOIN操作中。

如何找到正确的距离以正确的方式书写？这是一项非常重要的任务，我的截止日期是没有希望，希望能尽快获得帮助

Answer 1

以下是BigQuery标准SQL

#standardSQL
WITH output_1 AS (
  SELECT 
    ST_GEOGPOINT(stations1.longitude, stations1.latitude) AS WKT1,
    stations1.name AS start_st_name,
    ST_GEOGPOINT(stations2.longitude, stations2.latitude) AS WKT2,
    stations2.name AS end_st_name,
    ST_DISTANCE(ST_GEOGPOINT(stations1.longitude, stations1.latitude), ST_GEOGPOINT(stations2.longitude, stations2.latitude)) AS dist
  FROM bigquery-public-data.london_bicycles.cycle_hire AS trips
  INNER JOIN bigquery-public-data.london_bicycles.cycle_stations AS stations1
    ON trips.start_station_id = stations1.id 
  INNER JOIN bigquery-public-data.london_bicycles.cycle_stations AS stations2
    ON trips.end_station_id = stations2.id
), output_2 AS (
  SELECT 
    start_st_name AS start_station, 
    ROUND(AVG(dist), 2) AS average_distance
  FROM output_1
  GROUP BY start_st_name
)
SELECT *
FROM output_2
ORDER BY average_distance DESC
LIMIT 10

有输出

Row start_station                               average_distance     
1   Blackfriars Station, St. Paul's             5895.44  
2   Bonner Gate, Victoria Park                  4105.8   
3   Walworth Square, Walworth                   3751.54  
4   Bourne Street, Belgravia                    3681.56  
5   Clarence Walk, Stockwell                    3351.18  
6   Clapham Road, Lingham Street, Stockwell     3293.93  
7   Clapham Common North Side, Clapham Common   3268.38  
8   Limburg Road, Clapham Junction              3156.89  
9   Wandsworth Rd, Isley Court, Wandsworth Road 3148.16  
10  Sugden Road, Clapham    3107.68

Answer 2

我不认为您想要站点之间的“平均距离”。 2个电台之间始终保持相同的距离。

让我们首先创建一个表，其中包含所有可能的电台组合的JOIN：

CREATE TABLE temp_eu.stations AS (
   SELECT station1, station2
     , ST_DISTANCE(
         ST_GeogPoint(station1.longitude, station1.latitude)
         , ST_GeogPoint(station2.longitude, station2.latitude)) distance
   FROM `bigquery-public-data.london_bicycles.cycle_stations` station1
   JOIN `bigquery-public-data.london_bicycles.cycle_stations` station2
   USING(id)
); 
# 1.4 sec elapsed, 76.1 KB processed

现在您可以使用此数据扩充原始表格-如果需要的话，可以按距离排序：

SELECT
 distance, station1, station2
 ,hire.duration
 ,hire.bike_id
 ,hire.end_date
 ,hire.end_station_id
 ,hire.end_station_name
 ,hire.start_date
 ,hire.start_station_id
 ,hire.start_station_name
 from `bigquery-public-data.london_bicycles.cycle_hire` as hire
JOIN temp_eu.stations
ON hire.start_station_id = station1.id 
AND hire.end_station_id = station2.id
ORDER BY distance
LIMIT 100

找到坐标之间的平均距离

2 个答案: