下面是我编写的一个函数,用于根据索引范围标记某些行。为方便起见,我将以pickle格式下载两个函数参数samples和matdat。
from operator import itemgetter
from itertools import izip, imap
import pandas as pd
def _insert_design_columns(samples, matdat):
"""Add columns for design-factors, label lines that correspond to a given trials and
then fill in said columns with the appropriate value on lines that belong to a
trial.
samples : DataFrame
DataFrame of eyetracker samples.
column `t`: time sample, in ms
column `event`: TTL event
columns x, y: x and y coordinates of gaze
column cr: corneal reflection area
matdat : dict of numpy arrays
dict mapping matlab variable name to numpy array
returns : modified `samples` dataframe
"""
## This is fairly trivial preperation and data formatting for the nested
# for-loop below. We're just fixing types, adding empty columns, and
# ensuring that our numpy arrays have the right shape.
# Grab variables from the dict & squeeze the numpy arrays
key = ('cuepos', 'targetpos', 'targetorientation', 'soa', 'normalizedResp')
cpos, tpos, torient, soa, resp = map(pd.np.squeeze, imap(matdat.get, key))
cpos = cpos.astype(float)
cpos[cpos < 0] = pd.np.nan
cong = tpos == cpos
cong[pd.isnull(cpos)] = pd.np.nan
# Add empty columns for each factor. These will contain the factor level on
# that correspond to a trial (i.e. between a `TrialStart` and `ReportCueOnset` in
# `samples.event`
samples['soa'] = pd.np.nan
samples['cpos'] = pd.np.nan
samples['tpos'] = pd.np.nan
samples['cong'] = pd.np.nan
samples['torient'] = pd.np.nan
samples['normalizedResp'] = pd.np.nan
## This is important, but not the part we need to optimize.
# Here, we're finding the start and end indexes for every trial. Trials
# are composed of continuous slices of rows.
# Assign trial numbers
tstart = samples[samples.event == 'TrialStart'].t # each trial starts on a `TrialStart`
tstop = samples[samples.event == 'ReportCueOnset'].t # ... and ends on a `ReportCueOnset`
samples['trial'] = pd.np.nan # make an empty column which will contain trial num
## This is the sub-optimal part. Here, we're iterating through our start/end index
# pairs, slicing the dataframe to get the rows we need, and then:
# 1. Assigning a trial number to that slice of rows
# 2. Assigning the correct value to corresponding columns (see `factor_names`)
samples.set_index(['t'], inplace=True)
for i, (start, stop) in enumerate(izip(tstart, tstop)):
samples.loc[start:stop, 'trial'] = i + 1 # label the interval's trial number
# Now that we've labeled a range of rows as a trial, we can add factor levels
# to the corresponding columns
idx = itemgetter(i - 1)
# factor_values/names has the same length as the number of trials we're going to
# find. Get the corresponding value for the current trial so that we can
# assign it.
factor_values = imap(idx, (cpos, tpos, torient, soa, resp, cong))
factor_names = ('cpos', 'tpos', 'torient', 'soa', 'resp', 'cong')
for c, v in izip(factor_names, factor_values): # loop through columns and assign
samples.loc[start:stop, c] = v
samples.reset_index(inplace=True)
return samples
我已经执行了%prun
,其前几行是:
548568 function calls (547462 primitive calls) in 9.380 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
11360 6.074 0.001 6.084 0.001 index.py:604(__contains__)
2194 0.949 0.000 0.949 0.000 {method 'copy' of 'numpy.ndarray' objects}
1430 0.730 0.001 0.730 0.001 {pandas.lib.infer_dtype}
1098 0.464 0.000 0.467 0.000 internals.py:277(set)
1093/1092 0.142 0.000 9.162 0.008 indexing.py:157(_setitem_with_indexer)
1100 0.106 0.000 1.266 0.001 frame.py:1851(__setitem__)
166 0.047 0.000 0.047 0.000 {method 'astype' of 'numpy.ndarray' objects}
107209 0.037 0.000 0.066 0.000 {isinstance}
14 0.029 0.002 0.029 0.002 {numpy.core.multiarray.concatenate}
39362/38266 0.026 0.000 6.101 0.000 {getattr}
7829/7828 0.024 0.000 0.030 0.000 {numpy.core.multiarray.array}
1092 0.023 0.000 0.457 0.000 internals.py:564(setitem)
5 0.023 0.005 0.023 0.005 {pandas.algos.take_2d_axis0_float64_float64}
4379 0.021 0.000 0.108 0.000 index.py:615(__getitem__)
1101 0.020 0.000 0.582 0.001 frame.py:1967(_sanitize_column)
2192 0.017 0.000 0.946 0.000 internals.py:2236(apply)
8 0.017 0.002 0.017 0.002 {method 'repeat' of 'numpy.ndarray' objects}
根据读取1093/1092 0.142 0.000 9.162 0.008 indexing.py:157(_setitem_with_indexer)
的行判断,我强烈怀疑我的loc
嵌套循环赋值是罪魁祸首。整个功能需要大约9.3秒才能执行,并且总共需要执行144次(即~22分钟)。
有没有办法对我尝试做的任务进行矢量化或优化?