我正在尝试创建关于如何仅使用Python操作由CSV表组成的海量数据库的示例。
我想找到一种在表格中模拟有效索引查询的方法{@ 1}}
以下示例在3.2Ghz Core i5中需要24秒
list()
非常感谢更高效或更多 pythonic 方式。
答案 0 :(得分:5)
您可以itertools.islice
而不是阅读所有行并使用itertools.ifilter
:
import csv
from itertools import islice,ifilter
MAINDIR = "../"
with open(MAINDIR + "atp_players.csv") as pf, open(MAINDIR + "atp_rankings_current.csv") as rf:
players = list(csv.reader(pf))
rankings = csv.reader(rf)
# only get first ten rows using islice
for i in islice(rankings, None, 10):
# ifilter won't create a list, gives values in the fly
player = next(ifilter(lambda x: x[0] == i[2], players),"")
不太确定filter(lambda x: x[0]==i[2],players)[0]
正在做什么,你似乎每次都在搜索整个玩家列表,只保留第一个元素。使用第一个元素作为键对列表进行排序可能需要付费,并使用二分搜索或构建一个第一个元素作为键,行作为值然后只进行查找的字典。
import csv
from itertools import islice,ifilter
from collections import OrderedDict
MAINDIR = "../"
with open(MAINDIR + "atp_players.csv") as pf, open(MAINDIR + "atp_rankings_current.csv") as rf:
players = OrderedDict((row[0],row) for row in csv.reader(pf))
rankings = csv.reader(rf)
for i in islice(rankings, None, 10):
# now constant work getting row as opposed to 0(n)
player = players.get(i[2])
您使用什么默认值,或者如果需要,您必须自行决定。
如果您在每行的开头有重复元素,但只想返回第一个匹配项:
with open(MAINDIR + "atp_players.csv") as pf, open(MAINDIR + "atp_rankings_current.csv") as rf:
players = {}
for row in csv.reader(pf):
key = row[0]
if key in players:
continue
players[key] = row
rankings = csv.reader(rf)
for i in islice(rankings, None, 10):
player = players.get(i[2])
输出:
Djokovic(SRB),(R) Points: 11360
Federer(SUI),(R) Points: 9625
Nadal(ESP),(L) Points: 6585
Wawrinka(SUI),(R) Points: 5120
Nishikori(JPN),(R) Points: 5025
Murray(GBR),(R) Points: 4675
Berdych(CZE),(R) Points: 4600
Raonic(CAN),(R) Points: 4440
Cilic(CRO),(R) Points: 4150
Ferrer(ESP),(R) Points: 4045
十个玩家的代码时间显示ifilter是最快的,但是当我们提高排名时,我们会看到dict获胜,以及你的代码有多么糟糕:
In [33]: %%timeit
MAINDIR = "tennis_atp-master/"
pf = open ("/tennis_atp-master/atp_players.csv") players = [p for p in csv.reader(pf)]
rf =open( "/tennis_atp-master/atp_rankings_current.csv")
rankings = [r for r in csv.reader(rf)]
for i in rankings[:10]:
player = filter(lambda x: x[0]==i[2],players)[0]
....:
10 loops, best of 3: 123 ms per loop
In [34]: %%timeit
with open("/tennis_atp-master/atp_players.csv") as pf, open( "/tennis_atp-master/atp_rankings_current.csv") as rf: players = list(csv.reader(pf))
rankings = csv.reader(rf) # only get first ten rows using islice
for i in islice(rankings, None, 10):
# ifilter won't create a list, gives values in the fly
player = next(ifilter(lambda x: x[0] == i[2], players),"")
....:
10 loops, best of 3: 43.6 ms per loop
In [35]: %%timeit
with open("/tennis_atp-master/atp_players.csv") as pf, open( "/tennis_atp-master/atp_rankings_current.csv") as rf:
players = {}
for row in csv.reader(pf):
key = row[0]
if key in players:
continue
players[row[0]] = row
rankings = csv.reader(rf)
for i in islice(rankings, None, 10):
player = players.get(i[2])
pass
....:
10 loops, best of 3: 50.7 ms per loop
现在有100个玩家,你会看到dict和10的速度一样快。构建dict的成本已经被恒定的时间查找所抵消:
In [38]: %%timeit
with open("/tennis_atp-master/atp_players.csv") as pf, open("/tennis_atp-master/atp_rankings_current.csv") as rf:
players = list(csv.reader(pf))
rankings = csv.reader(rf)
# only get first ten rows using islice
for i in islice(rankings, None, 100):
# ifilter won't create a list, gives values in the fly
player = next(ifilter(lambda x: x[0] == i[2], players),"")
....:
10 loops, best of 3: 120 ms per loop
In [39]: %%timeit
with open("/tennis_atp-master/atp_players.csv") as pf, open( "/tennis_atp-master/atp_rankings_current.csv") as rf:
players = {}
for row in csv.reader(pf):
key = row[0]
if key in players:
continue
players[row[0]] = row
rankings = csv.reader(rf)
for i in islice(rankings, None, 100):
player = players.get(i[2])
pass
....:
10 loops, best of 3: 50.7 ms per loop
In [40]: %%timeit
MAINDIR = "tennis_atp-master/"
pf = open ("/tennis_atp-master/atp_players.csv")
players = [p for p in csv.reader(pf)]
rf =open( "/tennis_atp-master/atp_rankings_current.csv")
rankings = [r for r in csv.reader(rf)]
for i in rankings[:100]:
player = filter(lambda x: x[0]==i[2],players)[0]
....:
1 loops, best of 3: 806 ms per loop
250名选手:
# your code
1 loops, best of 3: 1.86 s per loop
# dict
10 loops, best of 3: 50.7 ms per loop
# ifilter
10 loops, best of 3: 483 ms per loop
最终测试循环整个排名:
# your code
1 loops, best of 3: 2min 40s per loop
# dict
10 loops, best of 3: 67 ms per loop
# ifilter
1 loops, best of 3: 1min 3s per loop
所以你可以看到,当我们循环更多排名时, dict 选项是迄今为止运行时效率最高的选项,并且可以非常好地扩展。
答案 1 :(得分:2)
考虑将您的数据放在SQLite database中。这符合您仅使用Python的要求,因为它内置于标准Python库中,并且(几乎)支持所有Python解释器。 SQLite是一个数据库库,允许您使用SQL语法对数据进行处理。它为您提供索引和外键关系等功能。
如果您需要对数据进行多次查询,那么进行一些预先计算(即索引和数据规范化)是最合理的途径。
答案 2 :(得分:0)
此代码不需要花费太多时间来运行。所以我假设你真的经历了更多只有10的排名。当我经历它们时,它需要很长时间。如果这是你感兴趣的事情,那么字典会缩短搜索时间。有关设置字典的一些开销,您可以非常快速地搜索它。以下是我修改你的for循环的方法:
play_dict = {}
for index, player in enumerate(players):
play_dict[player[0]] = index
for i in rankings[:10]:
player = players[play_dict[i[2]]]
使用此代码,您可以即时处理所有排名。