我已经创建了几个Naive Bayes分类器,并且我已经设法使用pickle保存/加载它们,并且代码正常工作。
我的分类器需要由另一台机器上的其他人运行,所以如果我向他们发送代码和pickle文件,我想确保它适用于其他人。我关注的主要部分是保存和加载泡菜,但如果还有其他任何看起来不好或效率低下的事情,我可以接受建议。
import nltk
import pickle
import pandas
raw_data = {
'first_name': ['Zekey', 'Josh', 'Jake', 'Tina', 'Amy','Matt','Mitt','Anny','Anniy','Maxett'],
'last_name': ['Jacobson', 'Jacobson', 'Cooze', 'Cooze', 'Milner','Milner','Milner','Milner','Milner','Milner'],
'gender': ['X','X','Male','Female','Female','Male','Male','Female','Female','Female']}
df = pandas.DataFrame(raw_data, columns = ['first_name', 'last_name','gender'])
last_name_dict = dict(iter(df.groupby("last_name")))
def gender_features(word):
return {'first_letter': word[0], 'last_letter': word[-1]}
dct = {}
for last in df.last_name.unique():
dct[last] = []
for first,gender in zip(last_name_dict[last].first_name, last_name_dict[last].gender):
dct[last].append((gender_features(first),gender))
#TRAIN in a loop
traindct = {}
class_dct = {}
for last in df.last_name.unique():
traindct['train_set_%s' % last] = dct[last][0:]
class_dct['Classif_%s'% last] = nltk.NaiveBayesClassifier.train(traindct['train_set_%s' % last])
f = open('classifya.pickle', 'wb')
pickle.dump(class_dct, f)
f.close()
f = open('classifya.pickle', 'rb')
class_dct = pickle.load(f)
f.close()
new_app=[]
for last in df.last_name.unique():
for first,gender in zip(last_name_dict[last].first_name,last_name_dict[last].gender):
new_app.append((first,last,gender,class_dct['Classif_%s'% last].classify(gender_features(first))))
df_output = pandas.DataFrame(new_app, columns=['First Name','Last Name','Gender','New Gender'])
有关更好/更快的方法的建议将非常感激。