如何在python中修剪日期中的离群值?

时间:2019-06-25 09:28:02

标签: python date trim

我有一个数据框<!-- Bootstrap docs: https://getbootstrap.com/docs --> <div class="container p-4 "> <div class="card"> <div class="card-header d-flex flex-row align-items-center"> <p class="float-left display-4">Administrator</p> <a href="#" class="btn btn-primary ml-auto" style="align-items: center" routerLink="../list" role="button"> <i class="fa fa-arrow-left"></i>&nbsp;Indietro</a></div> <div class="card-body"> <form> <div class="form-group"> <label for="email">Email:</label> <input type="email" class="form-control" id="email"> </div> <div class="form-group"> <label for="pwd">Password:</label> <input type="password" class="form-control" id="pwd"> </div> <div class="form-group"> <label for="displayName">Display name:</label> <input type="displayName" class="form-control" id="displayName"> </div> <button type="submit" class="btn btn-primary float-right">Salva</button> </form> </div> </div> </div>

df

,我要修剪日期中的异常值。因此,在此示例中,我想删除日期为0 2003-01-02 1 2015-10-31 2 2015-11-01 16 2015-11-02 33 2015-11-03 44 2015-11-04 的行。或者,在较大的数据框中,我要删除不位于95%或99%的时间间隔内的日期。有可以执行此操作的功能吗?

2 个答案:

答案 0 :(得分:0)

假设您已将列转换为日期时间格式:

import pandas as pd
import datetime as dt

df = pd.DataFrame(data)
df = pd.to_datetime(df[0])

您可以这样做:

include = df[df.dt.year > 2003]
print(include)

[out]:
1   2015-10-31
2   2015-11-01
3   2015-11-02
4   2015-11-03
5   2015-11-04
Name: 0, dtype: datetime64[ns]

看看here

...关于您的答案(基本上是相同的主意,...让我的朋友富有创造力):

s = pd.Series(df)
s10 = s.quantile(.10)
s90 = s.quantile(.90)

my_filtered_data = df[df.dt.year >= s10.year]
my_filtered_data = my_filtered_data[my_filtered_data.dt.year <= s90.year]

答案 1 :(得分:0)

您可以在SeriesDataFrame上使用quantile()

dates = [datetime.date(2003,1,2),
         datetime.date(2015,10,31),
         datetime.date(2015,11,1),
         datetime.date(2015,11,2),
         datetime.date(2015,11,3),
         datetime.date(2015,11,4)]
df = pd.DataFrame({'DATE': [pd.Timestamp(x) for x in dates]})
print(df)

qa = df['DATE'].quantile(0.1) #lower 10%
qb = df['DATE'].quantile(0.9) #higher 10%

print(qa, qb)

#remove outliers
xf = df[(df['DATE'] >= qa) & (df['DATE'] <= qb)]
print(xf)

输出为:

        DATE
0 2003-01-02
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03
5 2015-11-04
2009-06-01 12:00:00 2015-11-03 12:00:00
        DATE
1 2015-10-31
2 2015-11-01
3 2015-11-02
4 2015-11-03