找到与target_class的比例和相关性的代码：

Question

我有一个所有类别列的数据集，从那里我需要找到target_class的比例，即类别变量每个级别中的1。然后通过虚拟分类变量将每个级别的相关性附加到target_class上。以下是输入数据和预期输出的示例：

#Input Data:

df_data = pd.DataFrame(
{'production' : ['1101100000','1101100000','100100000','100100000','1101100000','1101100000','1001000000','1101100000','1101100000','1101100000'],
'enc_svod' : ['Free','Free','Pay','','Pay','Free','Free','','','Pay'],
'status' : [1,0,0,0,1,0,0,0,0,1]}
)

找到与target_class的比例和相关性的代码：

cat_cols = ['production','enc_svod']

# Code to find proportions and correlation with target_class:
# Now traverse through each column and calculate correlation and generate 
metrics
cat_count = 0
cat_metrics_df = pd.DataFrame()


for each_col in cat_cols:
    df_temp = pd.DataFrame()
    df_single_col_data = df_data[[each_col]]
    cat_count += 1

# Calculate uniques and nulls in each column to display in log file.    

    uniques_in_column = len(df_single_col_data[each_col].unique())
    nulls_in_column = df_single_col_data.isnull().sum()


    print('Working on column %s, converting to dummies and finding correlation with target' %(each_col))
    df_categorical_attribute = pd.get_dummies(df_single_col_data[each_col].astype(str), dummy_na=True, prefix=each_col)
    df_categorical_attribute = df_categorical_attribute.loc[:, df_categorical_attribute.var() != 0.0]# Drop columns with 0 variance.

    df_temp['correlation'] = df_categorical_attribute.corrwith(df_data['status'])

    try:
    # Calculate Index : Proportions of 1's within each CAT level

        frames = [df_single_col_data,df_data['status']]
        df_proportions = pd.concat(frames,axis = 1)
        df_proportions = df_proportions.fillna('nan').groupby(each_col,as_index = True).mean()
        df_proportions.index = [str(df_proportions.index.name) + '_' + str(x) for x in df_proportions.index.values]


        df_temp['Index'] = df_temp.join(df_proportions)['status']
        df_temp['Attribute'] = str(each_col)
        cat_metrics_df = cat_metrics_df.append(df_temp)
    except ValueError:
        print("Error for column %s:" %(each_col))
        continue

我尝试尝试的原因除了这里是因为某些变量存在Value Error的错误，如下所示：

Traceback (most recent call last):
  File "/user/data_processing_functions.py", line 443, in metrics_categorical
    df_temp['Index'] = df_temp.join(df_proportions)['disco_status']
  File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2331, in __setitem__
    self._set_item(key, value)
  File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2397, in _set_item
    value = self._sanitize_column(key, value)
  File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2547, in _sanitize_column
    value = reindexer(value)
  File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2539, in reindexer
    raise e
  File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2534, in reindexer
    value = value.reindex(self.index)._values
  File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/series.py", line 2426, in reindex
    return super(Series, self).reindex(index=index, **kwargs)
  File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 2515, in reindex
    fill_value, copy).__finalize__(self)
  File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 2533, in _reindex_axes
    copy=copy, allow_dups=False)
  File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 2627, in _reindex_with_indexers
    copy=copy)
  File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py", line 3886, in reindex_indexer
    self.axes[axis]._can_reindex(indexer)
  File "/user/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2836, in _can_reindex
    raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis

对于某些列，唯一值（19）比之后的剩余类别数更多：

df_categorical_attribute = df_categorical_attribute.loc[:, df_categorical_attribute.var() != 0.0]# Drop columns with 0 variance.

当我在熊猫版本为0.20.3的服务器上运行它时，就会发生这种情况，而在本地版本中，它是最新的-0.23.4。我不确定这是否是此错误的原因或其他原因。我想到使用Try Except，以便在出现ValueError错误时应跳过该列。我不确定为什么会这样，我猜测是由于整个数据上的空格-250万行* 1200列（我在本地使用了一个示例-50000），可能无法捕获我认为的那些情况。

如何解决ValueError：无法从python中的重复轴重新索引

找到与target_class的比例和相关性的代码：

0 个答案: