由于熊猫的灵活性,简单性以及我在R方面的经验,我在历史上一直使用熊猫。而且我真的不认为我会在速度方面放弃那么多...但是做了一些b今天比较有趣,让我有些挠头。
我会稍微使用集合,但是当我进行任何类型的数据科学时,我自然会偏向于熊猫,因为这是一个非常安全的举动。但是熊猫够快吗?我不确定...我想我在很多情况下可能都很草率。这些是很大的差异:
%%timeit
file = "Ex_Files_Python_Efficiently/Exercise Files/chapter2/02_05/goldmedals.txt"
df = pd.read_table(file,header=None, names=['year','athlete','team','event'])
def clean_event(name):
return ' '.join(word for word in name.split() if word not in ['men', 'women'])
df['event'] = df.apply(lambda x: clean_event(x['event']), axis=1)
df.pivot_table(values='team', index='event', columns='athlete', aggfunc='nunique').sum().sort_values(ascending=False)
>>> 29.4 ms ± 285 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
现在正在使用收藏集:
%%timeit
file = "Ex_Files_Python_Efficiently/Exercise Files/chapter2/02_05/goldmedals.txt"
Medal = collections.namedtuple('medal',['year','athlete','team','event'])
medals = [Medal(*line.strip().split('\t')) for line in open(file, 'r')]
d = collections.defaultdict(set)
def howmany(tup):
return len(tup[1])
def clean_event(name):
return ' '.join(word for word in name.split() if word not in ['men', 'women'])
for medal in medals:
d[medal.athlete].add(clean_event(medal.event))
sorted(d.items(), key=howmany, reverse=True)
>>> 2.94 ms ± 40.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
哇!我在这里缺少重要的东西吗?我很确定是df.apply占用了处理能力(clean_event不是用cython编写的,即使这样,我认为仍然必须回到python才能完成连接)。
这里有没有人经历过类似的a / b测试,或者有人可以指出我的方向来进一步挖掘这里?随着时间的流逝,我发现了一些零散的信息,但我会喜欢一些经验丰富的人偶然发现的任何智慧……
编辑: 将clean_event()更改为列表理解。仍在弄清楚我是否应该留在熊猫的数据透视表中,这会增加13ms的时间。
%%timeit
from typing import List
file = "Ex_Files_Python_Efficiently/Exercise Files/chapter2/02_05/goldmedals.txt"
df = pd.read_table(file,header=None, names=['year','athlete','team','event'])
def clean_event(names: List[str]):
return [' '.join(word for word in name.split() if word not in ['men', 'women']) for name in names]
df['event'] = clean_event(df['event'].values)
df.pivot_table(values='team', index='event', columns='athlete', aggfunc='nunique').sum().sort_values(ascending=False)
>>> 16.3 ms ± 58 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
最终编辑::我摆脱了数据透视表,并使用了groupby(),它消除了一些不必要的操作。让我们靠近。
%%timeit
from typing import List
file = "Ex_Files_Python_Efficiently/Exercise Files/chapter2/02_05/goldmedals.txt"
df = pd.read_table(file,header=None, names=['year','athlete','team','event'])
def clean_event(names: List[str]):
return [' '.join(word for word in name.split() if word not in ['men', 'women']) for name in names]
df['event'] = clean_event(df['event'].values)
df.groupby('athlete').nunique().sort_values('event', ascending=False)['event']
>>> 7.67 ms ± 64.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
我在这里的最终作品显示在这里:https://github.com/bejoinka/playground