这是earlier question的扩展名。
我想使用Pandas和Python迭代我的.csv文件,并按季节(和年份)对数据进行分组,计算一年中每个季节的平均值。目前,季度剧本是Jan-Mar,Apr-Jun等。我希望季节与月份相关
- 11: 'Winter', 12: 'Winter', 1: 'Winter', 2: 'Spring', 3: 'Spring', 4: 'Spring', 5: 'Summer', 6: 'Summer', 7: 'Summer', \
8: 'Autumn', 9: 'Autumn', 10: 'Autumn'
我有以下数据:
Date,HAD
01/01/1951,1
02/01/1951,-0.13161201
03/01/1951,-0.271796132
04/01/1951,-0.258977158
05/01/1951,-0.198823057
06/01/1951,0.167794502
07/01/1951,0.046093808
08/01/1951,-0.122396694
09/01/1951,-0.121824587
10/01/1951,-0.013002463
...
一直到
20/12/2098,62.817
21/12/2098,59.998
22/12/2098,50.871
23/12/2098,88.405
24/12/2098,81.154
25/12/2098,83.617
26/12/2098,120.675
27/12/2098,273.795
28/12/2098,316.324
29/12/2098,260.951
30/12/2098,198.505
31/12/2098,150.755
这是前面提到的问题的代码
import pandas as pd
import os
import re
lookup = {
11: 'Winter',
12: 'Winter',
1: 'Winter',
2: 'Spring',
3: 'Spring',
4: 'Spring',
5: 'Summer',
6: 'Summer',
7: 'Summer',
8: 'Autumn',
9: 'Autumn',
10: 'Autumn'
}
os.chdir('C:/Users/n-jones/testdir/output/')
for fname in os.listdir('.'):
if re.match(".*csv$", fname):
data = pd.read_csv(fname, parse_dates=[0], dayfirst=True)
data['Season'] = data['Date'].apply(lambda x: lookup[x.month])
data['count'] = 1
data = data.groupby(['Season'])['HAD', 'count'].sum()
data['mean'] = data['HAD'] / data['count']
data.to_csv('C:/Users/n-jones/testdir/season/' + fname)
我希望输出csv文件为:
Autumn 1951, Mean, Winter 1951/52, Mean, Spring 1952, Mean, Summer 1952, Mean,
Autumn 1952, Mean, Winter 1952/53, Mean, Spring 1953, Mean, Summer 1953, Mean,
依旧......
我希望这是有道理的。
提前谢谢!
答案 0 :(得分:2)
对于itertools.groupby是你最好的朋友的情况,这是一个很好的例子!
请原谅我没有扩展你的答案,但我对大熊猫不太熟悉,所以我选择使用csv模块。
通过编写两种方法对数据进行分组(get_season
和get_year
),只需迭代组,并将数据写入新的csv文件。
import csv
from datetime import datetime
from itertools import groupby
LOOKUP_SEASON = {
11: 'Winter',
12: 'Winter',
1: 'Winter',
2: 'Spring',
3: 'Spring',
4: 'Spring',
5: 'Summer',
6: 'Summer',
7: 'Summer',
8: 'Autumn',
9: 'Autumn',
10: 'Autumn'
}
def get_season(row):
date = datetime.strptime(row[0], '%d/%m/%Y')
season = LOOKUP_SEASON[date.month]
if season == 'Winter':
if date.month == 1:
last_year, next_year = date.year - 1, date.year
else:
last_year, next_year = date.year, date.year + 1
return '{} {}/{}'.format(season, last_year, next_year)
else:
return '{} {}'.format(season, date.year)
def get_year(row):
date = datetime.strptime(row[0], '%d/%m/%Y')
if date.month < 8:
return date.year - 1
else:
return date.year
with open('NJDATA.csv') as data_file, open('outfile.csv', 'wb') as out_file:
headers = next(data_file)
reader = csv.reader(data_file)
writer = csv.writer(out_file)
# Loop over groups distinguished by the "year" from Autumn->Summer,
# defined by the `get_year` function
for year, seasons in groupby(reader, get_year):
mean_data = []
# Loop over the data in the current year, grouped by season, defined
# by the get_season method. Since the required "season string"
# (e.g Autumn 1952) can be used as an identifier for the seasons,
# the `get_season` method returns the specific string which is used
# in the output, so you don't have to compile that one more time
# inside the for loops
for season_str, iter_data in groupby(seasons, get_season):
data = list(iter_data)
mean = sum([float(row[1]) for row in data]) / len(data)
# Use the next line instead if you want to control the precision
#mean = '{:.3f}'.format(sum([float(row[1]) for row in data]) / len(data))
mean_data.extend([season_str, mean])
writer.writerow(mean_data)
这里的基本想法是首先根据年份(秋季 - >夏季)对数据进行分组,然后按季节再次对数据进行分组。 groupby
函数接受两个参数;一个序列和一个功能。它遍历序列,每当提供的函数的返回值发生变化时,前面的数据就被视为一个不同的组。
考虑这个样本数据:
01/01/1951,1
02/01/1951,-0.13161201
01/04/1951,1
02/04/1951,-0.13161201
03/04/1951,-0.271796132
04/06/1951,-0.258977158
05/06/1951,-0.198823057
06/08/1951,0.167794502
...
09/02/1952,-0.121824587
第一个groupby
调用根据您的年度定义(在get_year
中定义)对数据进行分组,并提供以下数据组:
# get_year returns 1950
01/01/1951,1
...
05/06/1951,-0.198823057
# get_year returns 1951
06/08/1951,0.167794502
...
09/02/1952,-0.121824587
下一个groupby
方法根据季节(在get_season
中定义)对上述每个组进行分组。让我们考虑第一组:
# get_season returns 'Winter 1950/1951'
01/01/1951,1
02/01/1951,-0.13161201
# get_season returns 'Spring 1951'
01/04/1951,1
02/04/1951,-0.13161201
03/04/1951,-0.271796132
# get_season returns 'Summer 1951'
04/06/1951,-0.258977158
05/06/1951,-0.198823057
答案 1 :(得分:1)
这是一个简单的解决方案:
import pandas as pd
def year_and_season(x):
season = lookup[x.month]
year = x.year
if x.month == 12:
year += 1
return (year, season)
data = pd.read_csv('example.csv', index_col=0, parse_dates=[0], dayfirst=True)
yearsAndSeason = data.groupby(year_and_season).mean()
yearsAndSeason.to_csv('results.csv')
请注意,将读取时的索引列设置为日期,因此我们可以直接在groupBy
函数中访问其字段。在那里,我们将返回一个包含年份和季节的元组。您可以直接拨打mean
功能,而不是sum
。
results.csv
看起来并不像你期望的那样,因为键是以元组形式打印的,但可能你可以将这部分工作。这是它为我寻找的方式......
$ cat results.csv
,Mean
"(1951, 'Winter')",0.009545620900000005
"(2099, 'Winter')",145.65558333333334
答案 2 :(得分:1)
我遇到了同样的问题,发现重新采样方法只能使用参数3M(3个月)来实现。
我发现它感谢这个网站提供了一个与问题http://earthpy.org/time_series_analysis_with_pandas_part_2.html相关的例子。
如果您的数据框的索引为pandas datetime对象,那么您需要做的就是要求在3个月后重新采样。
In [108]:
data.head()
Out[108]:
Sample Measurement
mean
Date Local
2006-01-01 50.820833
2006-01-02 41.900000
2006-01-03 45.870833
2006-01-04 50.850000
2006-01-05 37.116667
In[109]:
#88 in order to beginn the resampling in march
wm = data[88:].resample('3M', closed='left')
wm.head()
out[109]:
Sample Measurement
mean
Date Local
2006-05-31 7.153622
2006-08-31 5.883025
2006-11-30 11.619724
2007-02-28 21.105789
2007-05-31 8.105313
这是我的数据集上的每日价值,我确实松了前三个月的数据,但我认为这是一个非常简单的季节性玩法