Question

我有一个带有字符变量的数据框。我的任务是计算每个变量的相对频率，并在相对频率小于某个阈值的情况下标记每个值（在相应的新标记变量中为二进制）。

到目前为止，我已经尝试过了（它对一个变量有效，我不确定如何在循环中完成此操作，或者是否有更好，更有效的解决方案来解决我的问题

    export default class ProblemModule { }

我也尝试了我的真实数据集-但是它显示的是NaN（但是，我现在知道原因了）

import pandas as pd
import numpy as np
data = {'Name': ['Alice', 'Alice ', 'Barbara', 'Carol', 'Henry','ds','sed'],
        'Sex' : ['M','F','F','F','M','f','m'],
        'Age' : [14,13,13,14,12,13,14],
        'Weight': [69.0, 56.5, 65.3, 62.8, 65.3,67,69],
        'Height': [112.5, 84.0, 98.0, 102.5, 102.5,101,105.3]}

cl =  pd.DataFrame(data)

# this is just to test on char variables 
cl1=cl.drop(['Age','Height','Weight'],axis=1).copy()

x=(cl.Sex.value_counts()/cl.shape[0]*100).to_frame().reset_index()
x.columns = ['Sex', 'Freq']

pd.merge(cl, x, on='Sex', how ='left')

我需要为输出数据集中的每个char变量设置标志变量，例如cl将具有sex_flag和name_flag或age_flag（在我看来是char变量）

所需的输出将像：

cat_data与上面的代码中的cl1等效。所需的输出将像：

以后需要删除AGE_freq列

Answer 1

import pandas as pd
import numpy as np
data = {'Name': ['Alice', 'Alice ', 'Barbara', 'Carol', 'Henry','ds','sed'],
        'Sex' : ['M','F','F','F','M','f','m'],
        'Age' : [14,13,13,14,12,13,14],
        'Weight': [69.0, 56.5, 65.3, 62.8, 65.3,67,69],
        'Height': [112.5, 84.0, 98.0, 102.5, 102.5,101,105.3]}

cl =  pd.DataFrame(data)
req_df=cl.copy()

col1=cl.columns[0]
cols=cl.columns[1:]

for col in cols:
    temp_df=cl[[col1,col]]
    x=temp_df[temp_df.columns[-1]].value_counts()/cl.shape[0]
    x=x.to_frame().reset_index()
    x.columns = [col, 'Freq'+str(col)]
    req_df=pd.merge(req_df, x, on=col, how ='left')

Answer 2

您需要代码来获取频率，然后将阈值设置为25％。

freq = cl.apply(lambda x: x.map(x.value_counts(normalize=True).mul(100).round(2).to_dict()))
freq = pd.DataFrame(np.where(freq>25, 1, 0), columns=freq.columns)
freq.columns = [x+'_flag' for x in freq.columns]
pd.concat([cl, freq],1)

输出：

       Name Sex  Age  Weight  Height  Name_flag  Sex_flag  Age_flag  Weight_flag  Height_flag
0    Alice   M   14    69.0   112.5          0         1         1            1            0
1   Alice    F   13    56.5    84.0          0         1         1            0            0
2  Barbara   F   13    65.3    98.0          0         1         1            1            0
3    Carol   F   14    62.8   102.5          0         1         1            0            1
4    Henry   M   12    65.3   102.5          0         1         0            1            1
5       ds   f   13    67.0   101.0          0         0         1            0            0
6      sed   m   14    69.0   105.3          0         0         1            1            0

在Python中将小于某个阈值的相对频率标记为离群值

2 个答案: