使用sklearn在python中使用BernoulliNB后无法找到正确的预测结果

时间:2018-04-26 19:36:58

标签: python pandas scikit-learn naivebayes multilabel-classification

所以我在线收集了一些代码用于我的研究工作和实践。我正在研究一个丹佛犯罪数据集。看起来像这样:

INCIDENT_ID               446399 non-null int64
OFFENSE_ID                446399 non-null int64
OFFENSE_CODE              446399 non-null int64
OFFENSE_CODE_EXTENSION    446399 non-null int64
OFFENSE_TYPE_ID           446399 non-null object
OFFENSE_CATEGORY_ID       446399 non-null object
FIRST_OCCURRENCE_DATE     446399 non-null object
LAST_OCCURRENCE_DATE      149714 non-null object
REPORTED_DATE             446399 non-null object
INCIDENT_ADDRESS          400668 non-null object
GEO_X                     442927 non-null float64
GEO_Y                     442927 non-null float64
GEO_LON                   442927 non-null float64
GEO_LAT                   442927 non-null float64
DISTRICT_ID               446399 non-null int64
PRECINCT_ID               446399 non-null int64
NEIGHBORHOOD_ID           446399 non-null object
IS_CRIME                  446399 non-null int64
IS_TRAFFIC                446399 non-null int64
dtypes: float64(4), int64(8), object(7)

picture of starting records of crime.csv

我在其上应用了此代码:

   def normalize(data): #feature normalization
        data = (data - data.mean()) / (data.max() - data.min())
        return data

    num2month= {1:'jan',2:'feb',3:'mar',4:'apr',5:'may',6:'jun',7:'jul',8:'aug',9:'sep',10:'oct',11:'nov',12:'dec'}

     crime = pd.read_csv('crime.csv')
     train, test = train_test_split(crime, test_size=0.2)
     test.to_csv('test.csv')
     train.to_csv('train.csv')
     train=pd.read_csv('train.csv', parse_dates = ['FIRST_OCCURRENCE_DATE'])
     test=pd.read_csv('test.csv', parse_dates = ['FIRST_OCCURRENCE_DATE'])
     #for training data 
         le_crime = preprocessing.LabelEncoder()
         crime = le_crime.fit_transform(train.OFFENSE_CATEGORY_ID)

         train['FIRST_OCCURRENCE_DATE'] = pd.to_datetime(train['FIRST_OCCURRENCE_DATE'])
         train['FIRST_OCCURRENCE_DATE(DAYOFWEEK)'] = train['FIRST_OCCURRENCE_DATE'].dt.weekday_name
         train['FIRST_OCCURRENCE_DATE(YEAR)'] = train['FIRST_OCCURRENCE_DATE'].dt.year
         train['FIRST_OCCURRENCE_DATE(MONTH)'] = train['FIRST_OCCURRENCE_DATE'].dt.month
         train['FIRST_OCCURRENCE_DATE(DAY)'] = train['FIRST_OCCURRENCE_DATE'].dt.day
         train['Year'] = train['FIRST_OCCURRENCE_DATE'].dt.year
         train['PdDistrict'] = train['OFFENSE_CATEGORY_ID']

         #Get binarized weekdays, districts, and hours.
         train['Days'] = train['FIRST_OCCURRENCE_DATE(DAYOFWEEK)']
         days = pd.get_dummies(train.Days)
         district = pd.get_dummies(train.PdDistrict)
         month = pd.get_dummies(train.FIRST_OCCURRENCE_DATE.dt.month.map(num2month))
         hour = train.FIRST_OCCURRENCE_DATE.dt.hour
         submit = pd.read_csv('submit.csv') 

         #Build new array
         new_datatr = pd.concat([hour, month, days, district], axis=1)
         new_datatr['X']=normalize(train.GEO_LON)
         new_datatr['Y']=normalize(train.GEO_LAT)
         new_datatr['hour']=normalize(train.FIRST_OCCURRENCE_DATE.dt.hour)

         new_datatr['crime']=crime

         new_datatr['dark'] = train.FIRST_OCCURRENCE_DATE.dt.hour.apply(lambda x: 1 if (x >= 18 or x < 6) else 0)

         train_proc = new_datatr



     #and similarly same code for test data set    
         test_proc = new_datatr

     features = [1,2,
        'jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec',
        'Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday', 
        #'X','Y'
            ] 

     training, validation = train_test_split(train_proc, train_size=.67)
     model = BernoulliNB()
     model.fit(training[features], training['crime'])
     predicted = np.array(model.predict_proba(validation[features]))
     log_loss(validation['crime'], predicted)


     model = BernoulliNB()
     model.fit(train_proc[features], train_proc['crime'])
     predicted = model.predict_proba(test_proc[features])


     le_crime = preprocessing.LabelEncoder()
     crime = le_crime.fit_transform(train.OFFENSE_CATEGORY_ID)
     result=pd.DataFrame(predicted, columns=le_crime.classes_)
     result.to_csv('submit.csv', index = True, index_label = 'Id' )

最后,当我打开提交文件时,我发现每个实例的类成员资格perptsages,它看起来像

enter image description here

我希望得到一个文档,用于预测确切的类别offense id,而不是类成员资格。

0 个答案:

没有答案