如果我有一些编造的数据:
import pandas as pd
import numpy as np
from numpy.random import randint
np.random.seed(10) # added for reproductibility
import numpy as np
import pandas as pd
np.random.seed(11)
rows,cols = 50000,2
data = np.random.rand(rows,cols)
tidx = pd.date_range('2019-01-01', periods=rows, freq='T')
df = pd.DataFrame(data, columns=['Temperature','Value'], index=tidx)
mediany= df.Value.median()
如何根据过滤出的 整天 数据来过滤df
?例如,对于每天的数据,如果Value
的平均值小于整个数据集(Value
)的平均值mediany
,则请舍弃这一天。
我认为我可以过滤掉所有少于平均值的数据,但这不能保留我需要的整天数据。
df = df[(df[['Value']] >= mediany).all(axis=1)]
df
希望所有提示都值得赞赏!
答案 0 :(得分:1)
您可以使用groupby().transform
:
s = (df['Value'].ge(mediany) # compare to mediany
.groupby(df.index.normalize()) # groupby day
.transform('any') # any time with value larger than median
)
df[s]
P / S:整个数据集的平均值(中位数),中位数不是平均值:-)
答案 1 :(得分:0)
您可以尝试以下代码。我添加了评论:-
import pandas as pd
import numpy as np
from numpy.random import randint
np.random.seed(10) # added for reproductibility
import numpy as np
import pandas as pd
np.random.seed(11)
rows,cols = 50000,2
data = np.random.rand(rows,cols)
tidx = pd.date_range('2019-01-01', periods=rows, freq='T')
df = pd.DataFrame(data, columns=['Temperature','Value'], index=tidx)
df["Date"] = df.index
#calculate day wise mean
def calculate_mean(x):
return np.mean(x)
df_day_mean = df.groupby(df.index).agg({
'Value': [('Value', calculate_mean)]
})
df_day_mean.columns = df_day_mean.columns.droplevel()
#calculate whole mean
mean = df.Value.mean()
#get the days where average value is less than whole mean
df_to_discard = df_day_mean[df_day_mean["Value"]<mean]
index_to_discard = df_to_discard.index
# drop these indices from the original df
filtered_df = df.drop(index_to_discard)