我有两个公共数据集,分别是1)自行车出行和2)车站,我需要在这些数据集中找到平均旅行距离最高的车站。我已经加入了两个数据集中的表格
bigquery-public-data.london_bicycles.cycle_hire
bigquery-public-data.london_bicycles.cycle_stations
每次旅行都有一个起点站和一个终点站。行程通常会具有与起点站不同的终点站,因此每次的距离都不同。想要找出哪个起点站到终点站的行程/距离平均最长,这意味着对于骑自行车的人来说,平均行程最远的终点站。
我希望1)输出是这样的:
Trip Start_station_coordinate start_st_name end_station_coordinate end_st_name km_dist
1 POINT(-0.123 51.123) A-station POINT(-0.123 51.123) B-station ??
2 POINT(-0.123 51.123) C-station POINT(-0.123 51.123) D-station ??
3 POINT(-0.123 51.123) D-station POINT(-0.123 51.123) F-station ??
...和2)由start_station分组,该start_station每次旅行的平均km_distance最高。像这样:
start_station average_distance_descending
A-station 20 km
B-station 15 km
C-station 3 km
我的代码是JOIN,我无法将以上内容合并到我的查询中(因为我是sql的新手)。我在问题行的最后尝试了以下方法:
`SELECT ST_GeogPoint(stations1.longitude, stations1.latitude) as WKT1
,stations1.id
,ST_GeogPoint(stations2.longitude, stations2.latitude) as WKT2
,stations2.id as id_2
,trips.end_station_id
,trips.start_station_id
from bigquery-public-data.london_bicycles.cycle_hire as trips
Inner JOIN bigquery-public-data.london_bicycles.cycle_stations as stations1
ON trips.start_station_id = stations1.id
Inner JOIN bigquery-public-data.london_bicycles.cycle_stations as stations2
ON trips.end_station_id = stations2.id
order by AVG(st_distance(WKT1, WKT2))`
BigQuery说:“ 仅在[22:5]出现GROUP BY或SELECT列表聚合的情况下,ORDER BY子句才允许聚合” ,请参考最后一行。我一直在绞尽脑汁寻找如何找到最大的平均距离(如果可能的话)以及如何将其结合到JOIN操作中。
如何找到正确的距离以正确的方式书写? 这是一项非常重要的任务,我的截止日期是没有希望,希望能尽快获得帮助
答案 0 :(得分:1)
以下是BigQuery标准SQL
#standardSQL
WITH output_1 AS (
SELECT
ST_GEOGPOINT(stations1.longitude, stations1.latitude) AS WKT1,
stations1.name AS start_st_name,
ST_GEOGPOINT(stations2.longitude, stations2.latitude) AS WKT2,
stations2.name AS end_st_name,
ST_DISTANCE(ST_GEOGPOINT(stations1.longitude, stations1.latitude), ST_GEOGPOINT(stations2.longitude, stations2.latitude)) AS dist
FROM bigquery-public-data.london_bicycles.cycle_hire AS trips
INNER JOIN bigquery-public-data.london_bicycles.cycle_stations AS stations1
ON trips.start_station_id = stations1.id
INNER JOIN bigquery-public-data.london_bicycles.cycle_stations AS stations2
ON trips.end_station_id = stations2.id
), output_2 AS (
SELECT
start_st_name AS start_station,
ROUND(AVG(dist), 2) AS average_distance
FROM output_1
GROUP BY start_st_name
)
SELECT *
FROM output_2
ORDER BY average_distance DESC
LIMIT 10
有输出
Row start_station average_distance
1 Blackfriars Station, St. Paul's 5895.44
2 Bonner Gate, Victoria Park 4105.8
3 Walworth Square, Walworth 3751.54
4 Bourne Street, Belgravia 3681.56
5 Clarence Walk, Stockwell 3351.18
6 Clapham Road, Lingham Street, Stockwell 3293.93
7 Clapham Common North Side, Clapham Common 3268.38
8 Limburg Road, Clapham Junction 3156.89
9 Wandsworth Rd, Isley Court, Wandsworth Road 3148.16
10 Sugden Road, Clapham 3107.68
答案 1 :(得分:0)
我不认为您想要站点之间的“平均距离”。 2个电台之间始终保持相同的距离。
让我们首先创建一个表,其中包含所有可能的电台组合的JOIN
:
CREATE TABLE temp_eu.stations AS (
SELECT station1, station2
, ST_DISTANCE(
ST_GeogPoint(station1.longitude, station1.latitude)
, ST_GeogPoint(station2.longitude, station2.latitude)) distance
FROM `bigquery-public-data.london_bicycles.cycle_stations` station1
JOIN `bigquery-public-data.london_bicycles.cycle_stations` station2
USING(id)
);
# 1.4 sec elapsed, 76.1 KB processed
现在您可以使用此数据扩充原始表格-如果需要的话,可以按距离排序:
SELECT
distance, station1, station2
,hire.duration
,hire.bike_id
,hire.end_date
,hire.end_station_id
,hire.end_station_name
,hire.start_date
,hire.start_station_id
,hire.start_station_name
from `bigquery-public-data.london_bicycles.cycle_hire` as hire
JOIN temp_eu.stations
ON hire.start_station_id = station1.id
AND hire.end_station_id = station2.id
ORDER BY distance
LIMIT 100