我需要从时间序列(每月频率)计算标准均值,但我还需要从计算中排除“不完整”年份(少于12个月)
Numpy / scipy“工作”版本:
import numpy as np
import scipy.stats as sts
url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices'
npdata = np.genfromtxt(url, skip_header=1)
unique_enso_year = [int(value) for value in set(npdata[:, 0])]
nin34 = np.zeros(len(unique_enso_year))
for ind, year in enumerate(unique_enso_year):
indexes = np.flatnonzero(npdata[:, 0]==year)
if len(indexes) == 12:
nin34[ind] = np.mean(npdata[indexes, 9])
else:
nin34[ind] = np.nan
nin34x = (nin34 - sts.nanmean(nin34)) / sts.nanstd(nin34)
array([[ 1.02250000e+00, 5.15000000e-01, -6.73333333e-01,
-7.02500000e-01, 1.16666667e-01, 1.32916667e+00,
-1.10333333e+00, -8.11666667e-01, 1.51666667e-01,
6.42500000e-01, 6.49166667e-01, 3.71666667e-01,
4.05000000e-01, -1.98333333e-01, -4.79166667e-01,
1.24666667e+00, -1.44166667e-01, -1.18166667e+00,
-8.89166667e-01, -2.51666667e-01, 7.36666667e-01,
3.02500000e-01, 3.83333333e-01, 1.19166667e-01,
1.70833333e-01, -5.25000000e-01, -7.35000000e-01,
3.75000000e-01, -4.50833333e-01, -8.30000000e-01,
-1.41666667e-02, nan]])
熊猫企图:
import pandas as pd
from datetime import datetime
def parse(yr, mon):
date = datetime(year=int(yr), day=2, month=int(mon))
return date
url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices'
data = pd.read_table(url, sep=' ', header=0, skiprows=0, parse_dates = [['YR', 'MON']], skipinitialspace=True, index_col=0, date_parser=parse)
grouped = data.groupby(lambda x: x.year)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = grouped.transform(zscore)
print transformed['ANOM.3']
YR_MON
1982-01-02 -0.986922
1982-02-02 -1.179216
1982-03-02 -1.179216
1982-04-02 -0.885119
1982-05-02 -0.376105
1982-06-02 0.087664
1982-07-02 -0.161188
1982-08-02 0.098975
1982-09-02 0.415695
1982-10-02 1.049134
1982-11-02 1.286674
1982-12-02 1.829622
1983-01-02 1.715072
1983-02-02 1.428598
1983-03-02 0.976272
...
2012-03-02 -0.999284
2012-04-02 -0.663736
2012-05-02 -0.063283
2012-06-02 0.572491
2012-07-02 0.961020
2012-08-02 1.314227
2012-09-02 0.925699
2012-10-02 0.537170
2012-11-02 0.660793
2012-12-02 -0.169245
2013-01-02 -1.001483
2013-02-02 -0.924445
2013-03-02 0.462223
2013-04-02 1.386668
2013-05-02 0.077037
Name: ANOM.3, Length: 377, dtype: float64
这不是我想要的......因为计数也是2013年(只有5个月)
提取我想要的东西我需要做一些像:
(grouped.mean()['ANOM.3'][:-1] - sts.nanmean(grouped.mean()['ANOM.3'][:-1])) / sts.nanstd(grouped.mean()['ANOM.3'][:-1])
但是这假设我现在已经知道去年是不完整的,然后我松开了np.NAN我应该有2013年价值
所以我现在正试图在像熊猫这样的pandas中进行查询:
grouped2 = data.groupby(lambda x: x.year).apply(lambda sdf: sdf if len(sdf) > 11 else None).reset_index(drop=True)
这给了我“正确的价值”..但这产生了一个新的数据框“没有带时间戳的索引”..我确信有一种简单而美丽的方式来做它...感谢任何帮助!
答案 0 :(得分:0)
import pandas as pd
url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices'
ts_raw = pd.read_table(url,
sep=' ',
header=0,
skiprows=0,
parse_dates = [['YR', 'MON']],
skipinitialspace=True,
index_col=0,
date_parser=parse)
ts_year_group = ts_raw.groupby(lambda x: x.year).apply(lambda sdf: sdf if len(sdf) > 11 else None)
ts_range = pd.date_range(ts_year_group.index[0][1],
ts_year_group.index[-1][1]+pd.DateOffset(months=1),
freq="M")
ts = pd.DataFrame(ts_year_group.values,
index=ts_range,
columns=ts_year_group.keys())
ts_fullyears_group = ts.groupby(lambda x: x.year)
nin_anomalies = (grouped.mean()['ANOM.3'] - sts.nanmean(grouped.mean()['ANOM.3'])) / sts.nanstd(grouped.mean()['ANOM.3'])
nin_anomalies
1982 1.527215
1983 0.779877
1984 -0.970047
1985 -1.012997
1986 0.193297
1987 1.978809
1988 -1.603259
1989 -1.173755
1990 0.244837
1991 0.967632
1992 0.977449
1993 0.568807
1994 0.617893
1995 -0.270568
1996 -0.684120
1997 1.857320
1998 -0.190803
1999 -1.718612
2000 -1.287880
2001 -0.349106
2002 1.106301
2003 0.466953
2004 0.585987
2005 0.196978
2006 0.273062
2007 -0.751613
2008 -1.060856
2009 0.573715
2010 -0.642396
2011 -1.200752
2012 0.000633
Name: ANOM.3, dtype: float64
我确信有更好的方法可以做同样的事情:/
答案 1 :(得分:0)
这是一个解决方案,因为你的约会时间是每个月的第二天,所以有时会有些讨厌。
开始时:
In [205]: import pandas as pd
In [206]: from datetime import datetime
In [207]: from datetime import timedelta
In [208]:
In [208]: def parse(yr, mon):
.....: date = datetime(year=int(yr), day=2, month=int(mon))
.....: return date
.....:
In [209]:
In [209]: url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices'
In [210]: data = pd.read_table(url, sep=' ', header=0, skiprows=0, parse_dates = [['YR', 'MON']], skipinitialspace=True, index_col=0, date_parser=parse)
In [211]: grouped = data.groupby(lambda x: x.year)
获得整整年:
In [212]: full_year = grouped['NINO1+2'].count() == 12
In [213]: full_year
Out[213]:
1982 True
1983 True
1984 True
1985 True
1986 True
1987 True
1988 True
1989 True
1990 True
1991 True
1992 True
1993 True
1994 True
1995 True
1996 True
1997 True
1998 True
1999 True
2000 True
2001 True
2002 True
2003 True
2004 True
2005 True
2006 True
2007 True
2008 True
2009 True
2010 True
2011 True
2012 True
2013 False
dtype: bool
现在我们处理以正确的数据类型获取索引并对齐。这可能会简化一点:
In [214]: strt = data.index[0] - timedelta(1)
In [215]: idx = pd.DatetimeIndex(start=strt, periods=len(full_year - 1), freq='BA-JAN')
In [216]: idx = idx + timedelta(1) # Get to 2nd of each month
In [232]: idx
Out[232]:
<class 'pandas.tseries.index.DatetimeIndex'>
[1982-01-02 00:00:00, ..., 2013-01-02 00:00:00]
Length: 32, Freq: None, Timezone: None
In [233]: full_year.index = idx
这是关键步骤:
In [234]: full_year = full_year.reindex_like(data, method='ffill')
希望这是正确的:
In [235]: data.ix[full_year].tail()
Out[235]:
NINO1+2 ANOM NINO3 ANOM.1 NINO4 ANOM.2 NINO3.4 ANOM.3 \
YR_MON
2012-08-02 20.99 0.35 25.72 0.73 29.10 0.42 27.55 0.73
2012-09-02 20.83 0.49 25.28 0.43 29.12 0.43 27.24 0.51
2012-10-02 20.68 -0.11 24.93 0.01 29.16 0.50 26.98 0.29
2012-11-02 21.21 -0.38 25.11 0.14 29.17 0.54 27.01 0.36
2012-12-02 22.13 -0.68 24.91 -0.23 28.71 0.23 26.46 -0.11
Unnamed: 10
YR_MON
2012-08-02 NaN
2012-09-02 NaN
2012-10-02 NaN
2012-11-02 NaN
2012-12-02 NaN
只需处理data.ix [full_year]就可以了。