在机器学习中,您通常拥有大量数据集,并且必须根据稍后使用的算法对它们进行不同的处理。您将如何编写一个函数来记住一个特定的预处理管道,然后直接加载结果而不是重新计算它们?
这里有一个小代码示例,可以帮助您理解我的意思
import numpy as np
import pickle
def f(data, scaling=None, reduction=None):
# here the function should check if it already has been called with the inputted keywords.
# If so it just has to load the results from that call from the hard drive and exit the function call
# data processing section
if scaling == 'standard':
# do scaling stuff
pass
if scaling == 'min_max':
# do other scaling stuff
pass
if reduction == 'PCA':
# do reduction stuff
pass
if reduction == 'ICA':
# do other reduction stuff
pass
# saving results on hard drive
with open('anypath', 'wb') as file:
pickle.dump(data, file)
return data
data = np.random.randint(100, size=(100,5))
config = {'scaling':'standard',
'reduction':'ICA'}
data_processed = f(data, **config)