我希望在sklearn中使用PCA模块对我们应用中的用户功能进行实时转换(针对垃圾邮件检测算法)。我们团队的负责人已经指定模型需要在100毫秒(最好是10-50毫秒)下进行预测。到目前为止,在我的计时测试中,我得到大约150毫秒。这是非常少量的数据(即我可以想象在实际应用中使用的最小量)。只是想知道是否有人遇到过这个问题,如果有的话,我能做些什么呢?在某些时候,我是否需要采取投入并推出自己的PCA?我的第一个想法是,如果sklearn不能那么快,我无法想象我的天真代码能够更快地完成它。对于像这样的实时应用程序,Python只是在性能方面有限吗?
要构建模型,在功能联合中组合了一组变换器:
transformer_list=[
('age', Pipeline([
('selector', ColumnSelector(column='birthdate')),
('birthdate', BirthdateTransformer()),
('vect', DictVectorizer(sparse=False))
])),
('domain', Pipeline([
('selector', ColumnSelector(column='email')),
('counter', EmailDomainName()),
('vect', DictVectorizer(sparse=False))])),
('sex', Pipeline([
('selector', ColumnSelector(column='sex')),
('dict', DictTransformer()),
('vect', DictVectorizer(sparse=False))
]))],
n_jobs=2,
transformer_weights=weights)
唯一的自定义类有BirthdateTransformer和EmailDomainName:
class BirthdateTransformer(TransformerMixin):
"""
takes list of Unix timestamps (seconds since epoch) and converts to list of ages in years
"""
def fit(self, dates, y=None):
return self
def transform(self, timestamps, y=None):
return [{'age group': age_group(ts / 1000)} for ts in timestamps]
class EmailDomainName(TransformerMixin):
"""
Class for building sklearn Pipeline step. This class takes a list of email addresses and returns a list of dicts
created by the Counter class from the collections module.
"""
regex = re.compile(r"@[\w.]+")
def pre_filter(self, email_list):
for e in email_list:
match = self.regex.search(e)
if match is not None:
yield match.group()
else:
yield ""
def fit(self, x, y=None):
return self
def transform(self, email_list):
"""
Use regular expression to pull out domain name from email address and convert to list of dicts containing
count of characters.
:param email_list:
:return:
"""
return [{k: v for (k, v) in Counter(result).most_common()} for result in
self.pre_filter(email_list)]
然后我简单地传入一个带有数据的Pandas数据帧,并对要投射到PCA空间的数据调用transform()。