如何在DataFrame上优化/矢量化此循环分配?

时间:2014-07-07 15:29:49

标签: python optimization pandas vectorization

下面是我编写的一个函数,用于根据索引范围标记某些行。为方便起见,我将以pickle格式下载两个函数参数samplesmatdat

from operator import itemgetter
from itertools import izip, imap

import pandas as pd


def _insert_design_columns(samples, matdat):
    """Add columns for design-factors, label lines that correspond to a given trials and
    then fill in said columns with the appropriate value on lines that belong to a
    trial.

    samples : DataFrame
         DataFrame of eyetracker samples.
         column `t`: time sample, in ms
         column `event`: TTL event
         columns x, y: x and y coordinates of gaze
         column cr:  corneal reflection area

    matdat : dict of numpy arrays
         dict mapping matlab variable name to numpy array

    returns : modified `samples` dataframe
    """
    ## This is fairly trivial preperation and data formatting for the nested
    #    for-loop below.  We're just fixing types, adding empty columns, and
    #    ensuring that our numpy arrays have the right shape.

    # Grab variables from the dict & squeeze the numpy arrays
    key = ('cuepos', 'targetpos', 'targetorientation', 'soa', 'normalizedResp')
    cpos, tpos, torient, soa, resp = map(pd.np.squeeze, imap(matdat.get, key))
    cpos = cpos.astype(float)
    cpos[cpos < 0] = pd.np.nan
    cong = tpos == cpos
    cong[pd.isnull(cpos)] = pd.np.nan

    # Add empty columns for each factor.  These will contain the factor level on
    # that correspond to a trial (i.e. between a `TrialStart` and `ReportCueOnset` in
    # `samples.event`
    samples['soa'] = pd.np.nan
    samples['cpos'] = pd.np.nan
    samples['tpos'] = pd.np.nan
    samples['cong'] = pd.np.nan
    samples['torient'] = pd.np.nan
    samples['normalizedResp'] = pd.np.nan

    ## This is important, but not the part we need to optimize.
    #     Here, we're finding the start and end indexes for every trial.  Trials
    #     are composed of continuous slices of rows.

    # Assign trial numbers
    tstart = samples[samples.event == 'TrialStart'].t  # each trial starts on a `TrialStart`
    tstop = samples[samples.event == 'ReportCueOnset'].t  # ... and ends on a `ReportCueOnset`
    samples['trial'] = pd.np.nan  # make an empty column which will contain trial num

    ## This is the sub-optimal part.  Here, we're iterating through our start/end index
    #    pairs, slicing the dataframe to get the rows we need, and then:
    #       1.  Assigning a trial number to that slice of rows
    #       2.  Assigning the correct value to corresponding columns (see `factor_names`)

    samples.set_index(['t'], inplace=True)
    for i, (start, stop) in enumerate(izip(tstart, tstop)):
        samples.loc[start:stop, 'trial'] = i + 1  # label the interval's trial number

        # Now that we've labeled a range of rows as a trial, we can add factor levels
        # to the corresponding columns
        idx = itemgetter(i - 1)
        # factor_values/names has the same length as the number of trials we're going to
        # find.  Get the corresponding value for the current trial so that we can
        # assign it.
        factor_values = imap(idx, (cpos, tpos, torient, soa, resp, cong))
        factor_names = ('cpos', 'tpos', 'torient', 'soa', 'resp', 'cong')
        for c, v in izip(factor_names, factor_values):  # loop through columns and assign
            samples.loc[start:stop, c] = v
    samples.reset_index(inplace=True)

    return samples

我已经执行了%prun,其前几行是:

         548568 function calls (547462 primitive calls) in 9.380 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    11360    6.074    0.001    6.084    0.001 index.py:604(__contains__)
     2194    0.949    0.000    0.949    0.000 {method 'copy' of 'numpy.ndarray' objects}
     1430    0.730    0.001    0.730    0.001 {pandas.lib.infer_dtype}
     1098    0.464    0.000    0.467    0.000 internals.py:277(set)
1093/1092    0.142    0.000    9.162    0.008 indexing.py:157(_setitem_with_indexer)
     1100    0.106    0.000    1.266    0.001 frame.py:1851(__setitem__)
      166    0.047    0.000    0.047    0.000 {method 'astype' of 'numpy.ndarray' objects}
   107209    0.037    0.000    0.066    0.000 {isinstance}
       14    0.029    0.002    0.029    0.002 {numpy.core.multiarray.concatenate}
39362/38266    0.026    0.000    6.101    0.000 {getattr}
7829/7828    0.024    0.000    0.030    0.000 {numpy.core.multiarray.array}
     1092    0.023    0.000    0.457    0.000 internals.py:564(setitem)
        5    0.023    0.005    0.023    0.005 {pandas.algos.take_2d_axis0_float64_float64}
     4379    0.021    0.000    0.108    0.000 index.py:615(__getitem__)
     1101    0.020    0.000    0.582    0.001 frame.py:1967(_sanitize_column)
     2192    0.017    0.000    0.946    0.000 internals.py:2236(apply)
        8    0.017    0.002    0.017    0.002 {method 'repeat' of 'numpy.ndarray' objects}

根据读取1093/1092 0.142 0.000 9.162 0.008 indexing.py:157(_setitem_with_indexer)的行判断,我强烈怀疑我的loc嵌套循环赋值是罪魁祸首。整个功能需要大约9.3秒才能执行,并且总共需要执行144次(即~22分钟)。

有没有办法对我尝试做的任务进行矢量化或优化?

0 个答案:

没有答案