使用pandas按年显示每月列数

时间:2016-03-14 11:27:38

标签: python pandas

我想将全球表面温度绘制为时间序列,并使用NASA GISS数据。数据按年,月和季节进行组织。

我想将它显示为1880年1月至2016年2月的时间序列,显示每月的价值。

读入数据和代码NA值

import pandas as pd
data = pd.read_csv("http://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts+dSST.csv",
na_values = ["**** ","***  "])

选择必要的数据

df = data.ix[:,1:19] 

添加年份列

df['Year'] = data[' Year']

我尝试按年份制作数据透视表,但这只是重现原始数据框架。

table = pd.pivot_table(df, index = df['Year'], values=['Jan','Feb', 'Mar','Apr','May','Jun',
'Jul','Aug','Sep','Oct','Nov','Dec'])

我希望数据框有一列按年份索引的数据值,每年每个月有12个值,我认为数据透视表会这样做,但我无法看到我在哪里出错了。

1 个答案:

答案 0 :(得分:4)

我认为您可以将meltrenameYear一起使用:

import pandas as pd
data = pd.read_csv("http://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts+dSST.csv",
na_values = ["**** ","***  "])
print data.head()

df1 = pd.melt(data, id_vars=[' Year'], value_vars=['Jan','Feb', 'Mar','Apr','May','Jun',
'Jul','Aug','Sep','Oct','Nov','Dec'], var_name='Month').rename(columns={' Year':'Year'})

print df1.columns
Index([u'Year', u'Month', u'value'], dtype='object')

print df1
       Year Month  value
0      1880   Jan    -29
1      1881   Jan     -9
2      1882   Jan     10
3      1883   Jan    -33
4      1884   Jan    -17
5      1885   Jan    -63
6      1886   Jan    -40
7      1887   Jan    -64
8      1888   Jan    -42
9      1889   Jan    -18
10     1890   Jan    -46
11     1891   Jan    -44
12     1892   Jan    -25
13     1893   Jan    -66
14     1894   Jan    -53
15     1895   Jan    -42
16     1896   Jan    -22
17     1897   Jan    -21
18     1898   Jan     -5
19     1899   Jan    -16
20     1900   Jan    -38
21     1901   Jan    -28
22     1902   Jan    -18
23     1903   Jan    -26
24     1904   Jan    -63
25     1905   Jan    -36
26     1906   Jan    -29
27     1907   Jan    -42
28     1908   Jan    -44
29     1909   Jan    -69
...     ...   ...    ...
1614   1987   Dec     48
1615   1988   Dec     33
1616   1989   Dec     36
1617   1990   Dec     41
1618   1991   Dec     32
1619   1992   Dec     22
1620   1993   Dec     19
1621   1994   Dec     36
1622   1995   Dec     30
1623   1996   Dec     40
1624   1997   Dec     59
1625   1998   Dec     57
1626   1999   Dec     47
1627   2000   Dec     30
1628   2001   Dec     54
1629   2002   Dec     42
1630   2003   Dec     73
1631   2004   Dec     51
1632   2005   Dec     67
1633   2006   Dec     78
1634   2007   Dec     49
1635   2008   Dec     54
1636   2009   Dec     64
1637   2010   Dec     48
1638   2011   Dec     53
1639   2012   Dec     52
1640   2013   Dec     66
1641   2014   Dec     79
1642   2015   Dec    110
1643   2016   Dec    NaN

[1644 rows x 3 columns]

然后,您可以使用to_datetime astype创建新的Datetimeindex

df1.index = pd.to_datetime(df1['Year'].astype(str) + df1['Month'], format='%Y%b')
timeserie = df1['value'].head()
print timeserie.head()
1880-01-01 00:00:00   -29
1881-01-01 00:00:00    -9
1882-01-01 00:00:00    10
1883-01-01 00:00:00   -33
1884-01-01 00:00:00   -17
Name: value, dtype: float64

print df1.index
DatetimeIndex(['1880-01-01', '1881-01-01', '1882-01-01', '1883-01-01',
               '1884-01-01', '1885-01-01', '1886-01-01', '1887-01-01',
               '1888-01-01', '1889-01-01',
               ...
               '2007-12-01', '2008-12-01', '2009-12-01', '2010-12-01',
               '2011-12-01', '2012-12-01', '2013-12-01', '2014-12-01',
               '2015-12-01', '2016-12-01'],
              dtype='datetime64[ns]', length=1644, freq=None)

如果您需要PeriodIndex,请使用to_period

df1.index = pd.to_datetime(df1['Year'].astype(str) + df1['Month'], format='%Y%b')
df1.index = df1.index.to_period('M')
timeserie = df1['value'].head()
print timeserie.head()
1880-01   -29
1881-01    -9
1882-01    10
1883-01   -33
1884-01   -17
Freq: M, Name: value, dtype: float64

print df1.index
PeriodIndex(['1880-01', '1881-01', '1882-01', '1883-01', '1884-01', '1885-01',
             '1886-01', '1887-01', '1888-01', '1889-01',
             ...
             '2007-12', '2008-12', '2009-12', '2010-12', '2011-12', '2012-12',
             '2013-12', '2014-12', '2015-12', '2016-12'],
            dtype='int64', length=1644, freq='M')

或者,您可以先使用[{1}}列选择df,然后ix选择月份,然后使用set_index stack。 最后,您可以使用设置列Year添加reset_index

names
    
import pandas as pd
data = pd.read_csv("http://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts+dSST.csv",
na_values = ["**** ","***  "])
print data.head()

df = data.ix[:,0:13] 

print df.columns
Index([u' Year', u'Jan', u'Feb', u'Mar', u'Apr', u'May', u'Jun', u'Jul',
       u'Aug', u'Sep', u'Oct', u'Nov', u'Dec'],
      dtype='object')

table = df.set_index(' Year').stack().reset_index()
table.columns = ['Year','Month','Value']