对于初学者来说,这是一些适合我的问题的人工数据:
df = pd.DataFrame(np.random.randint(0, 100, size=(vsize, 10)),
columns = ["col_{}".format(x) for x in range(10)],
index = range(0, vsize * 3, 3))
df_2 = pd.DataFrame(np.random.randint(0,100,size=(vsize, 10)),
columns = ["col_{}".format(x) for x in range(10, 20, 1)],
index = range(0, vsize * 2, 2))
df = df.merge(df_2, left_index = True, right_index = True, how = 'outer')
df_tar = pd.DataFrame({"tar_1": [np.random.randint(0, 2) for x in range(vsize * 3)],
"tar_2": [np.random.randint(0, 4) for x in range(vsize * 3)],
"tar_3": [np.random.randint(0, 8) for x in range(vsize * 3)],
"tar_4": [np.random.randint(0, 16) for x in range(vsize * 3)]})
df = df.merge(df_tar, left_index = True, right_index = True, how = 'inner')
现在,我想用每列中非NaN值的MEDIAN值填充每列中的NaN值,但是在该列中每个填充的NaN中添加噪声。应该首先为该列中属于同一类的值计算MEDIAN值,如第一个列tar_4所示。然后,如果任何NaN保留在该列中(因为该列中的某些值全部在tar_4类中,仅包含NaN,因此无法计算MEDIAN),则对更新的列重复相同的操作(已填充一些NaN (来自tar_4操作),但相对于tar_3列,其值属于同一类。然后是tar_2和tar_1。
我的想象方式如下:
其余各列使用相同的逻辑。
因此,预期的输出:在每列中,基于递减的基于tar_4-tar_1列的类的粒度级别,该数据帧具有已填充的NaN值。
由于@Quang Hoang,我已经有了一个实现该目标的代码:
def min_max_check(col):
if ((df[col].dropna() >= 0) & (df[col].dropna() <= 1.0)).all():
return medians[col]
elif (df[col].dropna() >= 0).all():
return medians[col] + round(np.random.randint(low = 0, high = 3) * stds[col], 2)
else:
return medians[col] + round(np.random.randint(low = -3, high = 3) * stds[col], 2)
tar_list = ['tar_4', 'tar_3', 'tar_2', 'tar_1']
cols = [col for col in df.columns if col not in tar_list]
# since your dataframe may not have continuous index
idx = df.index
for tar in tar_list:
medians = df[cols].groupby(by = df[tar]).agg('median')
std = df[cols].groupby(by = df[tar]).agg(np.std)
df.set_index(tar, inplace=True)
for col in cols:
df[col] = df[col].fillna(min_max_check(col))
df.reset_index(inplace=True)
df.index = idx
但是,这仅在每个粒度级别用相同的MEDIAN值+噪声填充NaN值。如何增强此代码,以便在例如下限为每个NaN值生成变化的填充值。 tar_4,tar_3,tar_2和tar_1级别?
答案 0 :(得分:0)
一种快速的解决方案是将每一行的min_max_check
修改为get_noise
:
def gen_noise(col):
num_row = len(df)
# generate noise of the same height as our dataset
# notice the size argument in randint
if ((df[col].dropna() >= 0) & (df[col].dropna() <= 1.0)).all():
noise = 0
elif (df[col].dropna() >= 0).all():
noise = np.random.randint(low = 0,
high = 3,
size=num_row)
else:
noise = np.random.randint(low = -3,
high = 3,
size=num_row)
# multiplication with isna() forces those at non-null values in df[col] to be 0
return noise * df[col].isna()
再后来:
df.set_index(tar, inplace=True)
for col in cols[:1]:
noise = gen_noise(col)
df[col] = (df[col].fillna(medians[col])
.add(noise.mul(stds[col]).values)
)
df.reset_index(inplace=True)
注意:在生成noise_df
且大小与medians
和stds
相同的for tar in tar_list:
medians = df[cols].groupby(df[tar]).agg('median')
stds = df[cols].groupby(df[tar]).agg('std')
# generate noise_df here
medians = medians + round(noise_df*std, 2)
df.set_index(tar, inplace=True)
for col in cols[:1]:
df[col] = df[col].fillna(medians[col])
df.reset_index(inplace=True)
df.index = idx
的意义上,您可以进一步修改代码
resource "aws_cloudwatch_event_rule" "check-ec2-tags" {
description = "Capture each EC2 instance creation"
event_pattern = <<PATTERN
{
"source": [ "aws.ec2" ],
"detail-type": [ "EC2 Instance State-change Notification" ],
"detail": {
"state": [ "running" ]
}
}
PATTERN
}