Python:快速子集化和循环数据帧

时间:2016-06-12 09:18:25

标签: python python-2.7 loops numpy pandas

我有以下最小代码太慢了。对于我需要的1000行,大约需要2分钟。我需要它跑得更快。

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0,1000,size=(1000, 4)), columns=list('ABCD'))
start_algorithm = time.time()
myunique = df['D'].unique()
for i in myunique:
    itemp = df[df['D'] == i]
    for j in myunique:
        jtemp = df[df['D'] == j]

我知道numpy可以让它运行得更快,但请记住,我想保留原始数据帧(或numpy中的数组)的一部分,以获得列' D'的特定值。如何改善其表现?

2 个答案:

答案 0 :(得分:4)

避免多次计算子数据框df[df['D'] == i]。原始代码计算此len(myunique)**2次。相反,您可以为每个i计算一次(即总共len(myunique)次),存储结果,然后将它们配对。例如,

    groups = [grp for di, grp in df.groupby('D')]
    for itemp, jtemp in IT.product(groups, repeat=2):
        pass
import pandas as pd
import itertools as IT
df = pd.DataFrame(np.random.randint(0,1000,size=(1000, 4)), columns=list('ABCD'))

def using_orig():
    myunique = df['D'].unique()
    for i in myunique:
        itemp = df[df['D'] == i]
        for j in myunique:
            jtemp = df[df['D'] == j]

def using_groupby():
    groups = [grp for di, grp in df.groupby('D')]
    for itemp, jtemp in IT.product(groups, repeat=2):
        pass
In [28]: %timeit using_groupby()
10 loops, best of 3: 63.8 ms per loop
In [31]: %timeit using_orig()
1 loop, best of 3: 2min 22s per loop

关于评论:

  

我可以轻松地用a = 1替换itemp和jtemp,或者打印“Hello”,所以忽略

上面的答案解决了如何更有效地计算itempjtemp的问题。如果itempjtemp不是您实际计算的核心,那么我们需要更好地理解您真正想要计算的内容,以便建议(如果可能)一种方式更快地计算它。

答案 1 :(得分:1)

以下是基于"D"列中的唯一元素形成组的矢量化方法 -

# Sort the dataframe based on the sorted indices of column 'D'
df_sorted = df.iloc[df['D'].argsort()]

# In the sorted dataframe's 'D' column find the shift/cut indces 
# (places where elements change values, indicating change of groups). 
# Cut the dataframe at those indices for the final groups with NumPy Split.
cut_idx = np.where(np.diff(df_sorted['D'])>0)[0]+1
df_split = np.split(df_sorted,cut_idx)

样本测试

1]使用随机元素形成示例数据框:

>>> df = pd.DataFrame(np.random.randint(0,100,size=(5, 4)), columns=list('ABCD'))
>>> df
    A   B   C   D
0  68  68  90  39
1  53  99  20  85
2  64  76  21  19
3  90  91  32  36
4  24   9  89  19

2]运行原始代码并打印结果:

>>> myunique = df['D'].unique()
>>> for i in myunique:
...     itemp = df[df['D'] == i]
...     print itemp
... 
    A   B   C   D
0  68  68  90  39
    A   B   C   D
1  53  99  20  85
    A   B   C   D
2  64  76  21  19
4  24   9  89  19
    A   B   C   D
3  90  91  32  36

3]运行建议的代码并打印结果:

>>> df_sorted = df.iloc[df['D'].argsort()]
>>> cut_idx = np.where(np.diff(df_sorted['D'])>0)[0]+1
>>> df_split = np.split(df_sorted,cut_idx)
>>> for split in df_split:
...     print split
... 
    A   B   C   D
2  64  76  21  19
4  24   9  89  19
    A   B   C   D
3  90  91  32  36
    A   B   C   D
0  68  68  90  39
    A   B   C   D
1  53  99  20  85