根据峰之间的距离过滤谷

时间:2019-04-21 16:41:53

标签: python pandas numpy scipy

我有以下数据框:

date    Values
3/1/2018    
3/3/2018    0
3/5/2018    -0.011630952
3/8/2018    0.024635792
3/10/2018   
3/10/2018   0.013662755
3/13/2018   2.563770771
3/15/2018   0.026081264
3/17/2018   
3/25/2018   4.890818119
3/26/2018   
3/28/2018   0.994944572
3/30/2018   0.098569691
4/2/2018    
4/2/2018    2.261398315
4/4/2018    2.595984459
4/7/2018    2.145072699
4/9/2018    2.401818037
4/11/2018   
4/12/2018   2.233839989
4/14/2018   2.179880142
4/17/2018   0.173141539
4/18/2018   
4/19/2018   0.04037559
4/22/2018   2.813424349
4/24/2018   2.764060259
4/27/2018   
5/2/2018    4.12789917
5/4/2018    4.282546997
5/4/2018    
5/7/2018    5.083333015
5/13/2018   
5/14/2018   1.615991831
5/17/2018   0.250209153
5/19/2018   5.003758907
5/20/2018   
5/22/2018   
5/24/2018   0.177665412
5/29/2018   
6/1/2018    3.190019131
6/3/2018    3.514900446
6/5/2018    2.796386003
6/6/2018    4.132686615
6/8/2018    
6/11/2018   2.82530117
6/14/2018   
6/16/2018   1.786619782
6/18/2018   
6/21/2018   1.60535562
6/21/2018   1.737388611
6/23/2018   0.048161745
6/26/2018   1.811254263
6/28/2018   0.109187543
6/30/2018   
7/1/2018    0.086753845
7/3/2018    2.141263962
7/6/2018    1.116563678
7/7/2018    1.159829378
7/8/2018    0.107431769
7/11/2018   -0.001963556
7/13/2018   
7/16/2018   
7/16/2018   0.071490705
7/18/2018   1.052834034
7/21/2018   
7/23/2018   
7/23/2018   1.201774001
7/28/2018   0.218167484
7/31/2018   0.504413128
8/1/2018    
8/2/2018    
8/5/2018    1.057194233
8/7/2018    0.85014987
8/8/2018    1.183927178
8/10/2018   1.226516366
8/12/2018   1.533656836
8/15/2018   
8/17/2018   
8/17/2018   1.355006456
8/20/2018   1.490438223
8/22/2018   
8/24/2018   1.160542369
8/25/2018   1.546550632
8/27/2018   
8/30/2018   

看起来像这样:

enter image description here

如果峰之间的距离小于14天,我想过滤掉峰之间的所有谷底。例如我想过滤掉5/7/20185/19/2018的峰值之间的低值,并用NaNs替换这些值。有很多scipy过滤器可以帮助平滑,但是我不确定如何根据我指定的条件去除槽。输出应该看起来像这样(如果我们在删除槽后拟合曲线):

enter image description here

基于@Asmus的建议,我希望最终结果会达到一个峰值,因此,高斯分布可能是最好的(重点放在了可能)。

2 个答案:

答案 0 :(得分:5)

重要提示::由于此答案已经很长了,我决定完全重写它,而不是第5次更新。如果您对“历史上下文”感兴趣,请查看版本历史记录


首先,运行一些必需的导入:

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from matplotlib import gridspec
import matplotlib as mpl
mpl.style.use('seaborn-paper') ## for nicer looking plots only

from lmfit import fit_report
from lmfit.models import GaussianModel, BreitWignerModel

然后清理数据(如上所述,另存为.csv):

df=pd.read_csv('pdFilterDates.txt',delim_whitespace=True) ## data as given above
df['date'] = pd.to_datetime(df['date'],format = '%m/%d/%Y')

## initial cleanup
df=df.dropna() ## clean initial NA values, e.g. 3/10/2018

## there is a duplicate at datetime.date(2018, 6, 21):
# print(df['date'][df.date.duplicated()].dt.date.values) 
df=df.groupby('date').mean().reset_index() ## so we take a mean value here
# print(df['date'][df.date.duplicated()].dt.date.values) ## see, no more duplicates

df = df.set_index('date',drop=False) ## now we can set date as index

并按每日频率重新编制索引:

complete_date_range_idx = pd.date_range(df.index.min(), df.index.max(),freq='D') 
df_filled=df.reindex(complete_date_range_idx, fill_value=np.nan).reset_index()
## obtain index values, which can be understood as time delta in days from the start

idx=df_filled.index.values ## this will be used again, in the end

## now we obtain (x,y) on basis of idx
not_na=pd.notna(df_filled['Values'])
x=idx[not_na]     ## len: 176 
y=df_filled['Values'][not_na].values

### let's write over the original df
df=df_filled 
#####

现在有趣的部分是:使用一些不对称的线形(Breit-Wigner-Fano)拟合数据,并删除位于 以下某个阈值的“离群值”。我们首先通过手动声明此峰应该在哪里来进行此操作(我们最初的猜测,可以删除3个点),然后使用拟合(拟合1)作为输入再次进行(去除8个点),最后获得最终适合。

根据要求,我们现在可以将拟合值插值到之前创建的每日索引(bwf_result_final.eval(x=idx))上,并在数据框y_fine中创建其他列,其中仅包含拟合值{ {1}},它保存了最终的点云(即,在离群值移除之后),以及一个合并的数据集(看起来是“锯齿状”)y_final。 最后,我们可以根据“精细”数据范围(y_joined)进行绘制。

Figure 1: iteratively removing outliers

Figure 2: cleaned up dataset

df['index']

答案 1 :(得分:1)

尝试一下:

# first find the peaks
# interpolate is important for find_peaks to work
peaks = (find_peaks(df.set_index('date').interpolate()
         .reset_index().Values, rel_height=0.1)[0])

# copy the peaks' dates for easy manipulation
peak_df = df.loc[peaks, ['date']].copy()

# mark where the peak was too close to the last
markers = (peak_df.date - peak_df.date.shift()).le(pd.Timedelta('14d'))

# filter
# df[markers.notnull()               # where the peaks are
#   | (~markers.bfill().eq(False))] # those between the peaks that are far enough

# as the above code gives an error
markers = ((markers.notnull() | (~markers.bfill().eq(False)))==True).index
df.loc[markers]