Question

我正在尝试找到一种替代方法，以从给定列的列表中循环遍历数据框并替换我想要的所有“nan”特定值。现在我正在使用超慢的 iterrows，有没有替代方法？

基本上我拥有的是一些带有 NAN 值的列，这些列应该只包含 1 或 0。所以我要做的是用每列中已经存在的 1 和 0 的百分比替换每一行中的 NAN。假设 column_T 有 30% 1s 和 70% 0s，所以对于每次迭代，我想通过 randint() 条件传递它，如果它小于 30% 阈值，它将为该特定行输入 1，反之亦然0 对于 NAN 的每个现有行，这将继续。

示例 Df。

<头>

Column_T	Column_I
1	1
0	1
nan	nan
1	0
nan	0
0	nan
1	1

for i, row in target_df.iterrows():
                for j in missing_col_list:
                    num_missing_obs = target_df[j].value_counts().sort_index()
                    chance_for_0s = (num_missing_obs[0]/(num_missing_obs[1]+num_missing_obs[0]))*100
                
                    #random assign 1s and 0s for missing data by calculated chance
                    if(str(row[j]) == 'nan'):
                        if (random.randint(0,100) < chance_for_0s): 
                            target_df.at[i,j] = 0.0
                        else:
                            target_df.at[i,j] = 1.0

Answer 1

这是使用 pandas.Series.sample 的不同方法：

for col in df.columns:
    isnan = df[col].isna()
    df.loc[isnan, col] = df[col].dropna().sample(isnan.sum()).values

在循环数据帧时更快地替代 Iterrows()

1 个答案: