Question

我正在尝试选择（带有WHERE子句）并通过python在sqlite3中排序一个大型数据库表。对于大约36 MB的数据，该排序目前需要30多分钟。我觉得它可以比索引更快地工作，但我认为我的代码顺序可能不正确。

代码按此处列出的顺序执行。

我的CREATE TABLE语句如下所示：

c.execute('''CREATE table gtfs_stop_times (
  trip_id text , --REFERENCES gtfs_trips(trip_id),
  arrival_time text, -- CHECK (arrival_time LIKE '__:__:__'),
  departure_time text, -- CHECK (departure_time LIKE '__:__:__'),
  stop_id text , --REFERENCES gtfs_stops(stop_id),
  stop_sequence int NOT NULL --NOT NULL
)''')

然后在下一步中插入行：

stop_times = csv.reader(open("tmp\\avl_stop_times.txt"))
c.executemany('INSERT INTO gtfs_stop_times VALUES (?,?,?,?,?)', stop_times)

接下来，我使用两列（trip_id和stop_sequence）创建索引：

c.execute('CREATE INDEX trip_seq ON gtfs_stop_times (trip_id, stop_sequence)')

最后，我使用SELECT子句运行WHERE语句，该子句按索引中使用的两列对此数据进行排序，然后将其写入csv文件：

c.execute('''SELECT gtfs_stop_times.trip_id, gtfs_stop_times.arrival_time, gtfs_stop_times.departure_time, gtfs_stops.stop_id, gtfs_stop_times.stop_sequence
FROM gtfs_stop_times, gtfs_stops
WHERE gtfs_stop_times.stop_id=gtfs_stops.stop_code
ORDER BY gtfs_stop_times.trip_id, gtfs_stop_times.stop_sequence;
)''')

f = open("gtfs_update\\stop_times.txt", "w")
writer = csv.writer(f, dialect = 'excel')
writer.writerow([i[0] for i in c.description]) # write headers
writer.writerows(c)
del writer

有没有办法加快第4步（可能会改变我添加和/或使用索引的方式），还是我应该在这个时候去吃午餐？

我添加了PRAGMA语句，试图提高性能无济于事：

c.execute('PRAGMA main.page_size = 4096')
c.execute('PRAGMA main.cache_size=10000')
c.execute('PRAGMA main.locking_mode=EXCLUSIVE')
c.execute('PRAGMA main.synchronous=NORMAL')
c.execute('PRAGMA main.journal_mode=WAL')
c.execute('PRAGMA main.cache_size=5000')

Answer 1

SELECT执行速度极快，因为没有gtfs_stops表，除了错误信息之外什么都没有。

如果我们假设有一个gtfs_stops表，那么您的trip_seq索引已经非常适合查询。但是，您还需要一个索引来查找stop_code列中的gtfs_stops值。

在Sqlite中对多索引大型数据库表进行排序

1 个答案: