我正在尝试使用GTFS数据库,即RATP为巴黎及其郊区提供的数据库。
数据集 huge 。 stop_times
表有1400万行。
这是表格模式:https://github.com/mauryquijada/gtfs-mysql/blob/master/gtfs-sql.sql
我正在尝试获得在特定位置获取可用路线的最有效方法。据我了解GTFS规范,这里是表格及其从我的数据(纬度/经度)到路线的链接:
stops | stop_times | trips | routes
-----------+----------------+------------+--------------
lat | stop_id | trip_id | route_id
lon | trip_id | route_id |
stop_id | | |
为了清晰起见,我在三个步骤(实际上是我们在上面四个表之间的三个链接)中编译了我想要的内容,在此要点下发布:https://gist.github.com/BenoitDuffez/4eba85e3598ebe6ece5f
以下是我创建此脚本的方法。
我能够在不到一秒的时间内快速找到步行距离(比方说200米)内的所有站点。我用:
$ . mysql.ini && time mysql -h $host -N -B -u $user -p${pass} $name -e "SELECT stop_id, (6371000*acos(cos(radians(48.824699))*cos(radians(s.stop_lat))*cos(radians(2.3243)-radians(s.stop_lon))+sin(radians(48.824699))*sin(radians(s.stop_lat)))) AS distance
FROM stops s
GROUP BY s.stop_id
HAVING distance < 200
ORDER BY distance ASC" | awk '{print $1}'
3705271
4472979
4036891
4036566
3908953
3908755
3900765
3900693
3900607
4473141
3705272
4472978
4036892
4036472
4035057
3908952
3705288
3908814
3900832
3900672
3900752
3781623
3781622
real 0m0.797s
user 0m0.000s
sys 0m0.000s
然后,今天晚些时候(使用stop_times.departure_time > '``date +%T``'
)获取所有stop_times需要花费大量时间:
"SELECT trip_id
FROM stop_times
WHERE
stop_id IN ($stops) AND departure_time >= '$now'
GROUP BY trip_id"
$stops
包含从第一步获得的停靠列表。这是一个例子:
$ . mysql.ini && time mysql -h $host -N -B -u $user -p${pass} $name -e "SELECT stop_id, (6371000*acos(cos(radians(
FROM stops s
GROUP BY s.stop_id
HAVING distance < 200
ORDER BY distance ASC" | awk '{print $1}'
3705271
4472979
4036891
4036566
3908953
...
9916360850964321
9916360920964320
9916360920964321
real 1m21.399s
user 0m0.000s
sys 0m0.000s
此结果中有超过2000行。
我的最后一步是选择与这些trip_id
匹配的所有路线。这很简单,也很快:
$ . mysql.ini && time mysql -h $host -u $user -p${pass} $name -e "SELECT r.id, r.route_long_name FROM trips t, routes r WHERE t.trip_id IN (`cat trip_ids | tr '\n' '#' | sed -e 's/##$//' -e 's/#/,/g'`) AND r.route_id = t.route_id GROUP BY t.route_id"
+------+-------------------------------------------------------------------------+
| id | route_long_name |
+------+-------------------------------------------------------------------------+
| 290 | (PLACE DE CLICHY <-> CHATILLON METRO) - Aller |
| 291 | (PLACE DE CLICHY <-> CHATILLON METRO) - Retour |
| 404 | (PORTE D'ORLEANS-METRO <-> ECOLE VETERINAIRE DE MAISON-ALFORT) - Aller |
| 405 | (PORTE D'ORLEANS-METRO <-> ECOLE VETERINAIRE DE MAISON-ALFORT) - Retour |
| 453 | (PORTE D'ORLEANS-METRO <-> LYCEE POLYVALENT) - Retour |
| 457 | (PORTE D'ORLEANS-METRO <-> LYCEE POLYVALENT) - Retour |
| 479 | (PORTE D'ORLEANS-METRO <-> VELIZY 2) - Retour |
| 810 | (PLACE DE LA LIBERATION <-> GARE MONTPARNASSE) - Aller |
| 989 | (PORTE D'ORLEANS-METRO) - Retour |
| 1034 | (PLACE DE LA LIBERATION <-> HOTEL DE VILLE DE PARIS_4E__AR) - Aller |
+------+-------------------------------------------------------------------------+
real 0m1.070s
user 0m0.000s
sys 0m0.000s
此处包含2k旅行ID的文件trip_ids
。
如何更快地获得此结果?是否有更好的方法来抓取数据而不是我采用的stops>stop_times>trips>routes
路径?
实际上一个'查询'的总时间约为30秒:“距离这个位置200米的路线有哪些?”。那太多了......
答案 0 :(得分:3)
简短的回答是:使用表连接和索引。
这里有更长的答案:
您在这里有正确的想法,并且您对表格如何相互关联的理解是正确的。但是,通过要求DBMS匹配列表中的字段值(使用WHERE...IN
)而不是将表连接在一起,您需要它完成比需要更多的工作。
您真正想要做的是将所有这些作为单个查询执行,使用JOIN
子句将表链接在一起。试试这个,它还加入了calendars
和calendar_dates
表,将结果限制为只有今天实际运行的路线:
SELECT DISTINCT r.id, r.route_long_name
FROM (SELECT s.stop_id, (6371000 *
acos(cos(radians(48.824699)) * cos(radians(s.stop_lat)) *
cos(radians(2.3243) - radians(s.stop_lon)) +
sin(radians(48.824699)) * sin(radians(s.stop_lat)))) AS distance
FROM stops AS s) AS i_s
INNER JOIN stop_times AS st ON st.stop_id = i_s.stop_id
INNER JOIN (SELECT trip_id, route_id FROM trips AS t
INNER JOIN (SELECT service_id FROM calendars
WHERE start_date <= '2014-09-09'
AND end_date >= '2014-09-09'
AND tuesday = 1
UNION
SELECT service_id FROM calendar_dates
WHERE date = '2014-09-09'
AND exception_type = 1
EXCEPT
SELECT service_id FROM calendar_dates
WHERE date = '2014-09-09'
AND exception_type = 2) AS c
ON c.service_id = t.service_id) AS t_r
ON t_r.trip_id = st.trip_id
INNER JOIN routes AS r ON r.route_id = t_r.route_id
WHERE st.departure_time > '$now'
AND i_s.distance < 200;
此处INNER JOIN
用于&#34;添加&#34;另一个表的列,仅包括与ON
子句中的条件匹配的行。这应该比使用一个查询生成结果列表然后将其提供给下一个查询更快 。
为了获得更好的性能,您需要创建一些索引,以防止DBMS必须线性扫描表。经验法则是为JOIN
或WHERE
子句中使用的每个列定义索引。以下是我定义的索引,您应该发现上述查询执行得非常好:
CREATE INDEX calendar_dates_date_exception_type_service_id_index
ON calendar_dates (date, exception_type, service_id);
CREATE INDEX trips_service_id_trip_id_route_id_index
ON trips (service_id, trip_id, route_id);
CREATE INDEX stop_times_trip_id_departure_time_stop_id_index
ON stop_times (trip_id, departure_time, stop_id);
CREATE INDEX routes_route_id_index ON routes (route_id);
CREATE INDEX stops_stop_id_index ON stops (stop_id);
答案 1 :(得分:1)
我使用的表格模式是完全错误的,我应该自己构建它,或者至少在使用之前对其进行分析。
这是一个更新的架构:
CREATE TABLE `agency` (
transit_system VARCHAR(50) NOT NULL,
agency_id VARCHAR(100),
agency_name VARCHAR(255) NOT NULL,
agency_url VARCHAR(255) NOT NULL,
agency_timezone VARCHAR(100) NOT NULL,
agency_lang VARCHAR(100),
agency_phone VARCHAR(100),
agency_fare_url VARCHAR(100),
PRIMARY KEY (agency_id)
);
CREATE TABLE `calendar_dates` (
id INT(12) NOT NULL PRIMARY KEY AUTO_INCREMENT,
transit_system VARCHAR(50) NOT NULL,
service_id VARCHAR(255) NOT NULL,
`date` VARCHAR(8) NOT NULL,
exception_type TINYINT(2) NOT NULL,
KEY `service_id` (service_id),
KEY `exception_type` (exception_type)
);
CREATE TABLE `calendar` (
id INT(12) NOT NULL PRIMARY KEY AUTO_INCREMENT,
transit_system VARCHAR(50) NOT NULL,
service_id VARCHAR(255) NOT NULL,
monday TINYINT(1) NOT NULL,
tuesday TINYINT(1) NOT NULL,
wednesday TINYINT(1) NOT NULL,
thursday TINYINT(1) NOT NULL,
friday TINYINT(1) NOT NULL,
saturday TINYINT(1) NOT NULL,
sunday TINYINT(1) NOT NULL,
start_date VARCHAR(8) NOT NULL,
end_date VARCHAR(8) NOT NULL,
KEY `service_id` (service_id)
);
CREATE TABLE `fare_attributes` (
id INT(12) NOT NULL PRIMARY KEY AUTO_INCREMENT,
transit_system VARCHAR(50) NOT NULL,
fare_id VARCHAR(100),
price VARCHAR(50) NOT NULL,
currency_type VARCHAR(50) NOT NULL,
payment_method TINYINT(1) NOT NULL,
transfers TINYINT(1) NOT NULL,
transfer_duration VARCHAR(10),
exception_type TINYINT(2) NOT NULL,
agency_id INT(100),
KEY `fare_id` (fare_id)
);
CREATE TABLE `fare_rules` (
id INT(12) NOT NULL PRIMARY KEY AUTO_INCREMENT,
transit_system VARCHAR(50) NOT NULL,
fare_id VARCHAR(100),
route_id VARCHAR(100),
origin_id VARCHAR(100),
destination_id VARCHAR(100),
contains_id VARCHAR(100),
KEY `fare_id` (fare_id),
KEY `route_id` (route_id)
);
CREATE TABLE `feed_info` (
id INT(12) NOT NULL PRIMARY KEY AUTO_INCREMENT,
transit_system VARCHAR(50) NOT NULL,
feed_publisher_name VARCHAR(100),
feed_publisher_url VARCHAR(255) NOT NULL,
feed_lang VARCHAR(255) NOT NULL,
feed_start_date VARCHAR(8),
feed_end_date VARCHAR(8),
feed_version VARCHAR(100)
);
CREATE TABLE `frequencies` (
id INT(12) NOT NULL PRIMARY KEY AUTO_INCREMENT,
transit_system VARCHAR(50) NOT NULL,
trip_id VARCHAR(100) NOT NULL,
start_time VARCHAR(8) NOT NULL,
end_time VARCHAR(8) NOT NULL,
headway_secs VARCHAR(100) NOT NULL,
exact_times TINYINT(1),
KEY `trip_id` (trip_id)
);
CREATE TABLE `routes` (
transit_system VARCHAR(50) NOT NULL,
route_id VARCHAR(100),
agency_id VARCHAR(50),
route_short_name VARCHAR(50) NOT NULL,
route_long_name VARCHAR(255) NOT NULL,
route_type VARCHAR(2) NOT NULL,
route_text_color VARCHAR(255),
route_color VARCHAR(255),
route_url VARCHAR(255),
route_desc VARCHAR(255),
PRIMARY KEY (route_id),
KEY `agency_id` (agency_id),
KEY `route_type` (route_type),
CONSTRAINT `agency_id` FOREIGN KEY (`agency_id`) REFERENCES `agency` (`agency_id`)
);
CREATE TABLE `shapes` (
id INT(12) NOT NULL PRIMARY KEY AUTO_INCREMENT,
transit_system VARCHAR(50) NOT NULL,
shape_id VARCHAR(100) NOT NULL,
shape_pt_lat DECIMAL(8,6) NOT NULL,
shape_pt_lon DECIMAL(8,6) NOT NULL,
shape_pt_sequence TINYINT(3) NOT NULL,
shape_dist_traveled VARCHAR(50),
KEY `shape_id` (shape_id)
);
CREATE TABLE `stops` (
transit_system VARCHAR(50) NOT NULL,
stop_id VARCHAR(255),
stop_code VARCHAR(50),
stop_name VARCHAR(255) NOT NULL,
stop_desc VARCHAR(255),
stop_lat DECIMAL(10,6) NOT NULL,
stop_lon DECIMAL(10,6) NOT NULL,
zone_id VARCHAR(255),
stop_url VARCHAR(255),
location_type VARCHAR(2),
parent_station VARCHAR(100),
stop_timezone VARCHAR(50),
wheelchair_boarding TINYINT(1),
PRIMARY KEY (stop_id),
KEY `zone_id` (zone_id),
KEY `stop_lat` (stop_lat),
KEY `stop_lon` (stop_lon),
KEY `location_type` (location_type),
KEY `parent_station` (parent_station)
);
CREATE TABLE `trips` (
transit_system VARCHAR(50) NOT NULL,
route_id VARCHAR(100) NOT NULL,
service_id VARCHAR(100) NOT NULL,
trip_id VARCHAR(255),
trip_headsign VARCHAR(255),
trip_short_name VARCHAR(255),
direction_id TINYINT(1), #0 for one direction, 1 for another.
block_id VARCHAR(11),
shape_id VARCHAR(11),
wheelchair_accessible TINYINT(1), #0 for no information, 1 for at least one rider accommodated on wheel chair, 2 for no riders accommodated.
bikes_allowed TINYINT(1), #0 for no information, 1 for at least one bicycle accommodated, 2 for no bicycles accommodated
PRIMARY KEY (trip_id),
KEY `route_id` (route_id),
KEY `service_id` (service_id),
KEY `direction_id` (direction_id),
KEY `block_id` (block_id),
KEY `shape_id` (shape_id),
CONSTRAINT `route_id` FOREIGN KEY (`route_id`) REFERENCES `routes` (`route_id`),
CONSTRAINT `service_id` FOREIGN KEY (`service_id`) REFERENCES `calendar` (`service_id`)
);
CREATE TABLE `stop_times` (
id INT(12) NOT NULL PRIMARY KEY AUTO_INCREMENT,
transit_system VARCHAR(50) NOT NULL,
trip_id VARCHAR(100) NOT NULL,
arrival_time VARCHAR(8) NOT NULL,
arrival_time_seconds INT(100),
departure_time VARCHAR(8) NOT NULL,
departure_time_seconds INT(100),
stop_id VARCHAR(100) NOT NULL,
stop_sequence VARCHAR(100) NOT NULL,
stop_headsign VARCHAR(50),
pickup_type VARCHAR(2),
drop_off_type VARCHAR(2),
shape_dist_traveled VARCHAR(50),
KEY `trip_id` (trip_id),
KEY `arrival_time_seconds` (arrival_time_seconds),
KEY `departure_time_seconds` (departure_time_seconds),
KEY `stop_id` (stop_id),
KEY `stop_sequence` (stop_sequence),
KEY `pickup_type` (pickup_type),
KEY `drop_off_type` (drop_off_type),
CONSTRAINT `trip_id` FOREIGN KEY (`trip_id`) REFERENCES `trips` (`trip_id`),
CONSTRAINT `stop_id` FOREIGN KEY (`stop_id`) REFERENCES `stops` (`stop_id`)
);
CREATE TABLE `transfers` (
id INT(12) NOT NULL PRIMARY KEY AUTO_INCREMENT,
transit_system VARCHAR(50) NOT NULL,
from_stop_id INT(100) NOT NULL,
to_stop_id VARCHAR(8) NOT NULL,
transfer_type TINYINT(1) NOT NULL,
min_transfer_time VARCHAR(100)
);
我已将xyz_id
个密钥作为PRIMARY KEY
放在他们自己的表格中,而FOREIGN KEY
则放在其他表格中。
我仍然需要对此架构进行一些优化。
现在这个查询的工作时间不到1-5秒:
SELECT
s.stop_id,
(6371000*acos(cos(radians(48.1128135))*cos(radians(s.stop_lat))*cos(radians(-1.6470705)-radians(s.stop_lon))+sin(radians(48.1128135))*sin(radians(s.stop_lat)))) AS distance,
t.route_id,
st.*,
t.*,
r.*,
c.*
FROM stop_times st
LEFT JOIN stops s USING (stop_id)
LEFT JOIN trips t USING (trip_id)
LEFT JOIN routes r USING (route_id)
LEFT JOIN calendar c ON c.service_id = t.service_id
where
c.start_date <= 20140915
and c.end_date >= 20140915
and c.sunday = 1
and st.departure_time > '15:00:00'
HAVING
distance < 200
ORDER BY st.departure_time ASC
答案 2 :(得分:0)
我只能告诉你我用SQL尝试了同样的事情,它花了很长时间,所以我不得不编写一个脚本,首先是Perl(没有增益),然后是C ++(增益快35倍)。