Question

我一直在研究数据集。我已经按照以下过程预测用户流失率：

1）编码和标准化的数据
2）运行随机森林
3）获得0.63的模型评分
4）分析功能的重要性
5）运行模型以简化功能集
6）修改后的模型的模型得分为1.0

我不确定为什么分数突然达到100％。我再次检查了功能的重要性，没有一个对预测有100％的贡献。我还确保使用了测试序列拆分功能，因此不应泄漏测试和训练数据。

如果有人可以帮助我，那真是太神奇了，因为我真的被困住了！

#!/usr/bin/env python
# coding: utf-8

# # The Scenario

# From https://www.kaggle.com/abhinav89/telecom-customer/version/1.
# 
# This data set consists of 100 variables and approx 100 thousand records. This data set contains different variables explaining the attributes of telecom industry and various factors considered important while dealing with customers of telecom industry. The target variable here is churn which explains whether the customer will churn or not. We can use this data set to predict the customers who would churn or who wouldn't churn depending on various variables available.

# # Import data

# In[1]:


import pandas as pd
path = "churn.csv"
df = pd.read_csv(path, delimiter=',', header='infer')
df.head()


# # Generate the X (features) and y (target) dataframes

# In[2]:


x=  df[[
 'rev_Mean',
 'mou_Mean',
 'totmrc_Mean',
 'da_Mean',
 'ovrmou_Mean',
 'ovrrev_Mean',
 'vceovr_Mean',
 'datovr_Mean',
 'roam_Mean',
 'change_mou',
 'change_rev',
 'drop_vce_Mean',
 'drop_dat_Mean',
 'blck_vce_Mean',
 'blck_dat_Mean',
 'unan_vce_Mean',
 'unan_dat_Mean',
 'plcd_vce_Mean',
 'plcd_dat_Mean',
 'recv_vce_Mean',
 'recv_sms_Mean',
 'comp_vce_Mean',
 'comp_dat_Mean',
 'custcare_Mean',
 'ccrndmou_Mean',
 'cc_mou_Mean',
 'inonemin_Mean',
 'threeway_Mean',
 'mou_cvce_Mean',
 'mou_cdat_Mean',
 'mou_rvce_Mean',
 'owylis_vce_Mean',
 'mouowylisv_Mean',
 'iwylis_vce_Mean',
 'mouiwylisv_Mean',
 'peak_vce_Mean',
 'peak_dat_Mean',
 'mou_peav_Mean',
 'mou_pead_Mean',
 'opk_vce_Mean',
 'opk_dat_Mean',
 'mou_opkv_Mean',
 'mou_opkd_Mean',
 'drop_blk_Mean',
 'attempt_Mean',
 'complete_Mean',
 'callfwdv_Mean',
 'callwait_Mean',
 'months',
 'uniqsubs',
 'actvsubs',
 'new_cell',
 'crclscod',
 'asl_flag',
 'totcalls',
 'totmou',
 'totrev',
 'adjrev',
 'adjmou',
 'adjqty',
 'avgrev',
 'avgmou',
 'avgqty',
 'avg3mou',
 'avg3qty',
 'avg3rev',
 'avg6mou',
 'avg6qty',
 'avg6rev',
 'prizm_social_one',
 'area',
 'dualband',
 'refurb_new',
 'hnd_price',
 'phones',
 'models',
 'hnd_webcap',
 'truck',
 'rv',
 'ownrent',
 'lor',
 'dwlltype',
 'marital',
 'adults',
 'infobase',
 'income',
 'numbcars',
 'HHstatin',
 'dwllsize',
 'forgntvl',
 'ethnic',
 'kid0_2',
 'kid3_5',
 'kid6_10',
 'kid11_15',
 'kid16_17',
 'creditcd',
 'eqpdays',
 'Customer_ID'
       ]]


y =  df[['churn']]

#check columns in new df
list(x)


# In[3]:


#show unique values in the dataframe column
df.churn.unique()


# # Standardize & encode data
# 
# When we’re getting our data ready for our machine learning models, it’s important to consider scaling and encoding.
# 
# Scaling is a method used to standardise the range of data. This is important as if one field stores age (between 18 and 90) and another stores salary (between 10,000 and 200,000), the machine learning algorithm might bias its results towards the larger numbers, as it may assume they’re more important. SciKitLearn state that “If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.”
# 
# Using this SciKitLearn library, we can convert each feature to have a mean of zero and a standard deviation of 1; removing the potential bias in the model.
# 
# For some models, this is an absolute requirement, as certain algorithms expect that your data is normally distributed and centre around zero.
# 
# Encoding is simple – machine learning algorithms can only accept numerical features. If you have input variables of Male & Female, we can encode them to be 0 or 1 so that they can be used in the machine learning model

# In[4]:


from sklearn.preprocessing import LabelEncoder, StandardScaler
import numpy as np

#encoding with get_dummies
x = pd.get_dummies( x )

#fill in NA values with zeros
x = x.fillna(0)

#standardize the scale
x = StandardScaler().fit_transform(x)

#convert dataframes to numpy arrays
x = np.array(x)
y = np.array(y)


# # Split data (75% training & 25% testing)

# In[5]:


from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(x, y, test_size = 0.25, random_state = 42)


# # Train the model (fit) on the training data


# In[15]:


from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
model = RandomForestClassifier(n_estimators = 1000, random_state = 42)

model.fit(train_features, train_labels.ravel())


# In[16]:


predictions = model.predict(test_features)


# In[17]:


model.score(train_features, train_labels)


# In[18]:


model.score(test_features, test_labels)


# # Can we remove some features?
#  - Reduces Overfitting
#  - Improves Accuracy
#  - Reduces Training Time

# In[19]:


importance = model.feature_importances_
importances = pd.DataFrame(importance)

dictionary = dict(zip(df.columns, model.feature_importances_))


# In[20]:


feature_matrix = pd.DataFrame(dictionary, index=[0])
featurex = feature_matrix.T
featurex.columns = ['meas']


# In[21]:


#Check the score for every column in the DF
sorted = featurex.sort_values(by=['meas'], ascending=False)
with pd.option_context("display.max_rows", 10000): 
    print(sorted)


# In[22]:


#create a new DF with only scores above a certain threshold
df_limited = df[['models',
'change_mou',
'hnd_webcap',
'churn',
'mou_Mean',
'change_rev',
'asl_flag',
'crclscod',
'adjmou',
'totrev',
'adjrev',
'rev_Mean',
'actvsubs',
'totmou',
'new_cell',
'totcalls',
'adjqty',
'mou_cvce_Mean',
'avgrev',
'avgqty',
'mou_opkv_Mean',
'mou_peav_Mean',
'avg3mou',
'mouowylisv_Mean',
'totmrc_Mean',
'mou_rvce_Mean',
'peak_vce_Mean',
'opk_vce_Mean',
'unan_vce_Mean',
'avg3qty',
'avgmou',
'recv_vce_Mean',
'owylis_vce_Mean',
'plcd_vce_Mean',
'attempt_Mean',
'complete_Mean',
'comp_vce_Mean',
'inonemin_Mean',
'drop_blk_Mean',
'mouiwylisv_Mean',
'drop_vce_Mean',
'ovrrev_Mean',
'ovrmou_Mean',
'iwylis_vce_Mean',
'blck_vce_Mean',
'avg3rev',
'vceovr_Mean',
'area']]


# In[23]:


#encoding with get_dummies
x2 = pd.get_dummies( df_limited )

#fill in NA values with zeros
x2 = x2.fillna(0)

#standardize the scale
x2 = StandardScaler().fit_transform(x2)

#convert dataframes to numpy arrays
x2 = np.array(x2)


# In[24]:


from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(x2, y, test_size = 0.25, random_state = 42)


# In[25]:


model = RandomForestClassifier(n_estimators = 1000, random_state = 42)
model.fit(train_features, train_labels.ravel())


# In[26]:


predictions = model.predict(test_features)


# In[27]:


model.score(train_features, train_labels)


# In[28]:


model.score(test_features, test_labels)

Answer 1

您需要降低训练集的流失率。因为您将其保留在里面，而这正是您要预测的内容，所以数据泄漏。在拆分训练并进行测试之前，请执行以下操作：

x2.drop(columns=['churn'], inplace=True)

如果可以，请接受。

Answer 2

您的x2变量即第二个训练数据集中有churn。该模型基本上是在记忆结果本身以预测结果。

您对任何功能都不具有100％的功能重要性的原因是因为您正在对数据集进行一次热编码，所以churn变量被分为多列。

x2.drop('churn',1, inplace=True)

这将解决您的问题

自特征选择以来，随机森林的得分为100％

2 个答案: