为什么我和所有人的预测值相同?

时间:2017-09-08 14:40:34

标签: python scikit-learn recommendation-engine

我正在尝试使用sci-kit构建决策树。但是我得到了与所有价值的预测相同的价值。

le = preprocessing.LabelEncoder()
    def labelEncoder(df, col_name):
       df[[col_name]] = le.fit_transform(df[[col_name]])
    labelEncoder(dfr, "Gender")
    labelEncoder(dfr, "Subscription Tenure Type")
    labelEncoder(dfr, "Located Region")
    labelEncoder(dfr, "Attrition")
    labelEncoder(dfr, "Type of subscription")
    labelEncoder(dfr, "Genre")
    # # Splitiing the data to test and train
    feature = dfr[["Gender", "Age", "Subscription year", "Subscription Tenure Type", "Type of subscription",
          "Located Region", "Average Hours of watching(Weekly)", "Attrition",
          "Web channle utilization", "Mobile Channel Utilization"]]
    labels = dfr[["Genre"]]
clf_gini = DecisionTreeClassifier(criterion="entropy", random_state=100,
                                     max_depth=3, min_samples_leaf=9 ,min_samples_split=2, splitter='random')

clf_gini.fit(feature_train, labels_train)
y_pred = clf_gini.predict(feature_test)

print(list((y_pred)))

以下是样本数据。

User Id Genre   Rating  Gender  Age Subscription year   Subscription Tenure Type    Type of subscription    Located Region  Average Hours of watching(Weekly)   Attrition   Web channle utilization Mobile Channel Utilization
1   Romance 4   Female  51  2000    Annual  Individual  R3  7   Yes 89  11
2   Action  4.769230769 Female  42  2004    6 Months    Individual  R6  13  No  88  12
2   Adventure   4.909090909 Female  42  2004    6 Months    Individual  R6  13  No  88  12
2   Comedy  4.2 Female  42  2004    6 Months    Individual  R6  13  No  88  12
2   Crime   5   Female  42  2004    6 Months    Individual  R6  13  No  88  12
2   Drama   4.2 Female  42  2004    6 Months    Individual  R6  13  No  88  12

2 个答案:

答案 0 :(得分:3)

您提供的代码段存在一些问题。

  • 您预测使用svm代替clf_gini;
  • 缺少实际将数据集拆分为火车和测试的代码;
  • 您是否将相同的转换应用于训练和测试集?

答案 1 :(得分:-1)

您正在呼叫svm而不是clf_gini。如果这不能解答您的问题,请您提供更多详细信息?

以下示例代码有效:

import pandas as pd

arr = [[1  , 'Romance', 4,   'Female',  51,  2000,    'Annual' , 'Individual' , 'R3',  7,   'Yes', 89,  11],
[2  , 'Action' , 4.7, 'Female',  42,  2004,    '6 Months' ,   'Individual',  'R6',  13,  'No',  88,  12],
[2  , 'Adventure',   4.9, 'Female',  42,  2004,    '6 Months',    'Individual',  'R6',  13,  'No',  88,  12],
[2  , 'Comedy' , 4.2, 'Female',  42 , 2004,    '6 Months' ,   'Individual',  'R6'  ,13,  'No',  88,  12],
[2  , 'Crime'  , 5  , 'Female',  42 , 2004,    '6 Months' ,   'Individual',  'R6' , 13,  'No',  88,  12],
[2  , 'Drama'  , 4.2, 'Female',  42,  2004,    '6 Months' ,   'Individual',  'R6',  13,  'No',  88,  12]]

headers = ['User Id', 'Genre',   'Rating',  'Gender',  'Age', 'Subscription year',   'Subscription Tenure Type', 'Type of subscription',  'Located Region',  'Average Hours of watching(Weekly)',   'Attrition',   'Web channle utilization', 'Mobile Channel Utilization']

dfr = pd.DataFrame(arr, columns = headers )

import sklearn
le = sklearn.preprocessing.LabelEncoder()
def labelEncoder(df, col_name):
    df[[col_name]] = le.fit_transform(df[[col_name]])
labelEncoder(dfr, "Gender")
labelEncoder(dfr, "Subscription Tenure Type")
labelEncoder(dfr, "Located Region")
labelEncoder(dfr, "Attrition")
labelEncoder(dfr, "Type of subscription")
labelEncoder(dfr, "Genre")

# # Splitiing the data to test and train

feature = dfr[["Gender", "Age", "Subscription year", "Subscription Tenure Type", "Type of subscription",
  "Located Region", "Average Hours of watching(Weekly)", "Attrition",
  "Web channle utilization", "Mobile Channel Utilization"]]

clf_gini = DecisionTreeClassifier(criterion="entropy", random_state=100,
                                 max_depth=3, min_samples_leaf=9 ,min_samples_split=2, splitter='random')

# create test / train split
dfr_train = dfr.iloc[:-1]
dfr_test = dfr.iloc[-1]
y_train = dfr_train['Genre']
y_test = dfr_test['Genre']

del dfr_train['Genre']
del dfr_test['Genre']


clf_gini.fit(dfr_train, y_train)
y_pred = clf_gini.predict(dfr_test)

print(list((y_pred)))