我将CSV中的数据加载到Pandas中并对某些字段进行验证,如下所示:
(1.5s) loans['net_mortgage_margin'] = loans['net_mortgage_margin'].map(lambda x: convert_to_decimal(x))
(1.5s) loans['current_interest_rate'] = loans['current_interest_rate'].map(lambda x: convert_to_decimal(x))
(1.5s) loans['net_maximum_interest_rate'] = loans['net_maximum_interest_rate'].map(lambda x: convert_to_decimal(x))
(48s) loans['credit_score'] = loans.apply(lambda row: get_minimum_score(row), axis=1)
(< 1s) loans['loan_age'] = ((loans['factor_date'] - loans['first_payment_date']) / np.timedelta64(+1, 'M')).round() + 1
(< 1s) loans['months_to_roll'] = ((loans['next_rate_change_date'] - loans['factor_date']) / np.timedelta64(+1, 'M')).round() + 1
(34s) loans['first_payment_change_date'] = loans.apply(lambda x: validate_date(x, 'first_payment_change_date', loans.columns), axis=1)
(37s) loans['first_rate_change_date'] = loans.apply(lambda x: validate_date(x, 'first_rate_change_date', loans.columns), axis=1)
(39s) loans['first_payment_date'] = loans.apply(lambda x: validate_date(x, 'first_payment_date', loans.columns), axis=1)
(39s) loans['maturity_date'] = loans.apply(lambda x: validate_date(x, 'maturity_date', loans.columns), axis=1)
(37s) loans['next_rate_change_date'] = loans.apply(lambda x: validate_date(x, 'next_rate_change_date', loans.columns), axis=1)
(36s) loans['first_PI_date'] = loans.apply(lambda x: validate_date(x, 'first_PI_date', loans.columns), axis=1)
(36s) loans['servicer_name'] = loans.apply(lambda row: row['servicer_name'][:40].upper().strip(), axis=1)
(38s) loans['state_name'] = loans.apply(lambda row: str(us.states.lookup(row['state_code'])), axis=1)
(33s) loans['occupancy_status'] = loans.apply(lambda row: get_occupancy_type(row), axis=1)
(37s) loans['original_interest_rate_range'] = loans.apply(lambda row: get_interest_rate_range(row, 'original'), axis=1)
(36s) loans['current_interest_rate_range'] = loans.apply(lambda row: get_interest_rate_range(row, 'current'), axis=1)
(33s) loans['valid_credit_score'] = loans.apply(lambda row: validate_credit_score(row), axis=1)
(60s) loans['origination_year'] = loans['first_payment_date'].map(lambda x: x.year if x.month > 2 else x.year - 1)
(< 1s) loans['number_of_units'] = loans['unit_count'].map(lambda x: '1' if x == 1 else '2-4')
(32s) loans['property_type'] = loans.apply(lambda row: validate_property_type(row), axis=1)
大多数是找到row
值的函数,有些函数直接将元素转换为其他元素,但总而言之,这些函数是逐行运行的。编写此代码时,数据框足够小,这不是问题。但是,代码现在适用于占用更大的表,因此代码的这一部分需要花费太长时间。
优化此功能的最佳方法是什么?我的第一个想法是逐行,但在行上应用所有这些函数/转换(即对于df中的行,执行func1,func2,...,func21),但我不确定是否这是解决这个问题的最佳方法。有没有办法避免lambda得到相同的结果,例如,因为我认为它的lambda需要很长时间?在重要的情况下运行Python 2.7。
编辑:这些调用中的大多数以每行大约相同的速率运行(少数几个非常快)。这是一个包含277,659行的数据帧,在大小方面为80%。
Edit2:函数示例:
def validate_date(row, date_type, cols):
date_element = row[date_type]
if date_type not in cols:
return np.nan
if pd.isnull(date_element) or len(str(date_element).strip()) < 2: # can be blank, NaN, or "0"
return np.nan
if date_element.day == 1:
return date_element
else:
next_month = date_element + relativedelta(months=1)
return pd.to_datetime(dt.date(next_month.year, next_month.month, 1))
这类似于从日期对象(年,月等)中提取值的最长调用(origination_year)。其他的,例如property_type,只是检查不规则值(例如&#34; N / A&#34;,&#34; NULL&#34;等等)但是仍需要一段时间才能查看每一个值
答案 0 :(得分:0)
td; lr:考虑分发处理。改进是以块的形式读取数据并使用多个进程。来源http://gouthamanbalaraman.com/blog/distributed-processing-pandas.html
import multiprocessing as mp
def process_frame(df): len(x)
if __name__ == "__main__":
reader = read_csv(csv-file, chunk_size=CHUNKSIZE)
pool = mp.Pool(4) # use 4 processes
funclist = []
for df in reader:
# process each data frame
f = pool.apply_async(process_frame,[df])
funclist.append(f)
result = 0
for f in funclist:
result += f.get(timeout=10) # timeout in 10 seconds
print "There are %d rows of data"%(result)
另一种选择可能是使用GNU parallel。 这是使用GNU parallel
的另一个好例子