熊猫vs收藏:速度

时间:2018-12-24 06:42:46

标签: python python-3.x pandas

由于熊猫的灵活性,简单性以及我在R方面的经验,我在历史上一直使用熊猫。而且我真的不认为我会在速度方面放弃那么多...但是做了一些b今天比较有趣,让我有些挠头。

我会稍微使用集合,但是当我进行任何类型的数据科学时,我自然会偏向于熊猫,因为这是一个非常安全的举动。但是熊猫够快吗?我不确定...我想我在很多情况下可能都很草率。这些是很大的差异:

%%timeit
file = "Ex_Files_Python_Efficiently/Exercise Files/chapter2/02_05/goldmedals.txt"
df = pd.read_table(file,header=None, names=['year','athlete','team','event'])

def clean_event(name):
    return ' '.join(word for word in name.split() if word not in ['men', 'women'])

df['event'] = df.apply(lambda x: clean_event(x['event']), axis=1)
df.pivot_table(values='team', index='event', columns='athlete', aggfunc='nunique').sum().sort_values(ascending=False)

>>> 29.4 ms ± 285 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

现在正在使用收藏集:

%%timeit
file = "Ex_Files_Python_Efficiently/Exercise Files/chapter2/02_05/goldmedals.txt"
Medal = collections.namedtuple('medal',['year','athlete','team','event'])
medals = [Medal(*line.strip().split('\t')) for line in open(file, 'r')]

d = collections.defaultdict(set)

def howmany(tup):
    return len(tup[1])
def clean_event(name):
    return ' '.join(word for word in name.split() if word not in ['men', 'women'])

for medal in medals:
    d[medal.athlete].add(clean_event(medal.event))
sorted(d.items(), key=howmany, reverse=True)

>>> 2.94 ms ± 40.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

哇!我在这里缺少重要的东西吗?我很确定是df.apply占用了处理能力(clean_event不是用cython编写的,即使这样,我认为仍然必须回到python才能完成连接)。

这里有没有人经历过类似的a / b测试,或者有人可以指出我的方向来进一步挖掘这里?随着时间的流逝,我发现了一些零散的信息,但我会喜欢一些经验丰富的人偶然发现的任何智慧……

编辑: 将clean_event()更改为列表理解。仍在弄清楚我是否应该留在熊猫的数据透视表中,这会增加13ms的时间。

%%timeit
from typing import List
file = "Ex_Files_Python_Efficiently/Exercise Files/chapter2/02_05/goldmedals.txt"
df = pd.read_table(file,header=None, names=['year','athlete','team','event'])

def clean_event(names: List[str]):
    return [' '.join(word for word in name.split() if word not in ['men', 'women']) for name in names]

df['event'] = clean_event(df['event'].values)
df.pivot_table(values='team', index='event', columns='athlete', aggfunc='nunique').sum().sort_values(ascending=False)
>>> 16.3 ms ± 58 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

最终编辑::我摆脱了数据透视表,并使用了groupby(),它消除了一些不必要的操作。让我们靠近。

%%timeit
from typing import List
file = "Ex_Files_Python_Efficiently/Exercise Files/chapter2/02_05/goldmedals.txt"
df = pd.read_table(file,header=None, names=['year','athlete','team','event'])

def clean_event(names: List[str]):
    return [' '.join(word for word in name.split() if word not in ['men', 'women']) for name in names]

df['event'] = clean_event(df['event'].values)
df.groupby('athlete').nunique().sort_values('event', ascending=False)['event']
>>> 7.67 ms ± 64.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

我在这里的最终作品显示在这里:https://github.com/bejoinka/playground

0 个答案:

没有答案