我有一个所有类别列的数据集,从那里我需要找到target_class的比例,即类别变量每个级别中的1。 然后通过虚拟分类变量将每个级别的相关性附加到target_class上。 以下是输入数据和预期输出的示例:
#Input Data:
df_data = pd.DataFrame(
{'production' : ['1101100000','1101100000','100100000','100100000','1101100000','1101100000','1001000000','1101100000','1101100000','1101100000'],
'enc_svod' : ['Free','Free','Pay','','Pay','Free','Free','','','Pay'],
'status' : [1,0,0,0,1,0,0,0,0,1]}
)
cat_cols = ['production','enc_svod']
# Code to find proportions and correlation with target_class:
# Now traverse through each column and calculate correlation and generate
metrics
cat_count = 0
cat_metrics_df = pd.DataFrame()
for each_col in cat_cols:
df_temp = pd.DataFrame()
df_single_col_data = df_data[[each_col]]
cat_count += 1
# Calculate uniques and nulls in each column to display in log file.
uniques_in_column = len(df_single_col_data[each_col].unique())
nulls_in_column = df_single_col_data.isnull().sum()
print('Working on column %s, converting to dummies and finding correlation with target' %(each_col))
df_categorical_attribute = pd.get_dummies(df_single_col_data[each_col].astype(str), dummy_na=True, prefix=each_col)
df_categorical_attribute = df_categorical_attribute.loc[:, df_categorical_attribute.var() != 0.0]# Drop columns with 0 variance.
df_temp['correlation'] = df_categorical_attribute.corrwith(df_data['status'])
try:
# Calculate Index : Proportions of 1's within each CAT level
frames = [df_single_col_data,df_data['status']]
df_proportions = pd.concat(frames,axis = 1)
df_proportions = df_proportions.fillna('nan').groupby(each_col,as_index = True).mean()
df_proportions.index = [str(df_proportions.index.name) + '_' + str(x) for x in df_proportions.index.values]
df_temp['Index'] = df_temp.join(df_proportions)['status']
df_temp['Attribute'] = str(each_col)
cat_metrics_df = cat_metrics_df.append(df_temp)
except ValueError:
print("Error for column %s:" %(each_col))
continue
我尝试尝试的原因除了这里是因为某些变量存在Value Error的错误,如下所示:
Traceback (most recent call last):
File "/user/data_processing_functions.py", line 443, in metrics_categorical
df_temp['Index'] = df_temp.join(df_proportions)['disco_status']
File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2331, in __setitem__
self._set_item(key, value)
File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2397, in _set_item
value = self._sanitize_column(key, value)
File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2547, in _sanitize_column
value = reindexer(value)
File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2539, in reindexer
raise e
File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2534, in reindexer
value = value.reindex(self.index)._values
File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/series.py", line 2426, in reindex
return super(Series, self).reindex(index=index, **kwargs)
File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 2515, in reindex
fill_value, copy).__finalize__(self)
File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 2533, in _reindex_axes
copy=copy, allow_dups=False)
File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 2627, in _reindex_with_indexers
copy=copy)
File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py", line 3886, in reindex_indexer
self.axes[axis]._can_reindex(indexer)
File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2836, in _can_reindex
raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis
对于某些列,唯一值(19)比之后的剩余类别数更多:
df_categorical_attribute = df_categorical_attribute.loc[:, df_categorical_attribute.var() != 0.0]# Drop columns with 0 variance.
当我在熊猫版本为0.20.3的服务器上运行它时,就会发生这种情况,而在本地版本中,它是最新的-0.23.4。 我不确定这是否是此错误的原因或其他原因。 我想到使用Try Except,以便在出现ValueError错误时应跳过该列。 我不确定为什么会这样,我猜测是由于整个数据上的空格-250万行* 1200列 (我在本地使用了一个示例-50000),可能无法捕获我认为的那些情况。