Question

如果我有一些编造的数据：

import pandas as pd 
import numpy as np 
from numpy.random import randint


np.random.seed(10)  # added for reproductibility                                                                                                                                                                 

import numpy as np
import pandas as pd
np.random.seed(11)

rows,cols = 50000,2
data = np.random.rand(rows,cols) 
tidx = pd.date_range('2019-01-01', periods=rows, freq='T') 
df = pd.DataFrame(data, columns=['Temperature','Value'], index=tidx)

mediany= df.Value.median()

如何根据过滤出的整天数据来过滤df？例如，对于每天的数据，如果Value的平均值小于整个数据集（Value）的平均值mediany，则请舍弃这一天。

我认为我可以过滤掉所有少于平均值的数据，但这不能保留我需要的整天数据。

df = df[(df[['Value']] >= mediany).all(axis=1)]

df

希望所有提示都值得赞赏！

Answer 1

您可以使用groupby().transform：

s = (df['Value'].ge(mediany)            # compare to mediany
        .groupby(df.index.normalize())  # groupby day 
        .transform('any')               # any time with value larger than median
    )

df[s]

P / S：整个数据集的平均值（中位数），中位数不是平均值：-）

Answer 2

您可以尝试以下代码。我添加了评论：-

import pandas as pd 
import numpy as np 
from numpy.random import randint


np.random.seed(10)  # added for reproductibility                                                                                                                                                                 

import numpy as np
import pandas as pd
np.random.seed(11)

rows,cols = 50000,2
data = np.random.rand(rows,cols) 
tidx = pd.date_range('2019-01-01', periods=rows, freq='T') 
df = pd.DataFrame(data, columns=['Temperature','Value'], index=tidx)
df["Date"] = df.index

#calculate day wise mean

def calculate_mean(x):
    return np.mean(x)

df_day_mean = df.groupby(df.index).agg({
        'Value': [('Value', calculate_mean)]
    })
df_day_mean.columns = df_day_mean.columns.droplevel()


#calculate whole mean

mean = df.Value.mean()

#get the days where average value is less than whole mean

df_to_discard = df_day_mean[df_day_mean["Value"]<mean]
index_to_discard = df_to_discard.index

# drop these indices from the original df

filtered_df = df.drop(index_to_discard)

大熊猫会根据值过滤掉整天的数据集

2 个答案: