我的环境中有2个pyspark数据帧:
df
y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12
12 rf 22 34 32 54 54 21 43 544 545 332
12 ed 23 34 23 53 23 23 22 434 342 432
.. .. .. .. .. .. .. .. .. ... ... ...
df_filtered
y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12
12 rf 35 34 32 54 54 21 43 544 545 332
12 ed 99 34 23 53 23 23 22 434 342 432
.. .. .. .. .. .. .. .. .. ... ... ...
y3是一个时间戳列,我的问题陈述是: 从给定y1的df_filtered中选择一个时间戳,即y3。 从df中提取该时间戳之前的10个值。 计算y7至y12的分位数范围。 检查df_filtered值是否在分位数范围内。 任何一个值都超出df_filtered中标记为异常的行,否则为实。
预期产量
y1 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 status
12 rf 22 34 32 54 54 21 43 544 545 332 real
12 ed 23 34 23 53 23 23 22 434 342 432 outlier
.. .. .. .. .. .. .. .. .. ... ... ... ......
如果我考虑采用pandas数据框格式的数据。以下for循环正在工作:
low = 0.25
high = 0.75
list1 = []
for i in range(len(df_filtered)):
x = df[(df['y3'] >= df_filtered.loc[df_filtered.index[i],'y3']-10) & (df['y3'] < df_filtered.loc[df_filtered.index[i],'y3']) & (df['y1'] == df_filtered.loc[df_filtered.index[i],'y1'])]
y = x.quantile([low,high])
#y2 is automatically removed as y2 was a text column hence quatile not calculated
y = y.drop(['y1','y3','y4','y5','y6'],axis=1)
for j in y.columns.values:
t1 = []
t = "real" if y[j][low] <= df_filtered.loc[df_filtered.index[i],j] <= y[j][high] else "outlier"
t1.append(t)
t2 = 'outlier' if 'outlier' in t1 else 'real'
list1.append(t2)
df_filtered['status'] = list1