Question

我很好奇last（）和first（）在此特定实例中的功能（当链接到重采样时）。如果我错了，请纠正我，但我理解您是否将参数传递给第一个和最后一个，例如3;它会返回前3个月或前3年。

在这种情况下，由于我没有在第一个和最后一个参数中传递任何参数，因此当我像这样重采样时，它实际上在做什么？我知道，如果我通过链接.mean（）进行重新采样，那么我会将所有月份的平均值重新采样为具有平均分数的年份，但是当我使用last（）时会发生什么？

更重要的是，为什么在这种情况下first（）和last（）给我不同的答案？从数字上我看到它们是不相等的。

即：post2008.resample（）。first（）！= post2008.resample（）。last（）

Tldr：
i）.first（）和.last（）的作用
ii）在链接到重采样时，.first（）和.last（）在这种情况下的作用
iii）为什么.resample（）。first（）！= .resample（）。last（）

这是聚合之前的代码

# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv('GDP.csv', index_col='DATE', parse_dates=True)

# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008-01-01':,:]

# Print the last 8 rows of post2008
print(post2008.tail(8))

这是print（post2008.tail（8））的输出：

              VALUE
DATE               
2014-07-01  17569.4
2014-10-01  17692.2
2015-01-01  17783.6
2015-04-01  17998.3
2015-07-01  18141.9
2015-10-01  18222.8
2016-01-01  18281.6
2016-04-01  18436.5

这是通过last（）重新采样和聚合的代码

# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()
print(yearly)

这是它的post2008.resample（'A'）。last（）每年的情况

              VALUE
DATE               
2008-12-31  14549.9
2009-12-31  14566.5
2010-12-31  15230.2
2011-12-31  15785.3
2012-12-31  16297.3
2013-12-31  16999.9
2014-12-31  17692.2
2015-12-31  18222.8
2016-12-31  18436.5

这是通过first（）重新采样和聚合的代码

# Resample post2008 by year, keeping first(): yearly
yearly = post2008.resample('A').first()
print(yearly)

这是它的post2008.resample（'A'）。first（）每年的情况

            VALUE
DATE               
2008-12-31  14668.4
2009-12-31  14383.9
2010-12-31  14681.1
2011-12-31  15238.4
2012-12-31  15973.9
2013-12-31  16475.4
2014-12-31  17025.2
2015-12-31  17783.6
2016-12-31  18281.6

Answer 1

首先，让我们用示例数据创建一个数据框：

import pandas as pd
dates = pd.DatetimeIndex(['2014-07-01', '2014-10-01', '2015-01-01',
                            '2015-04-01', '2015-07-01', '2015-07-01',
                            '2016-01-01', '2016-04-01'])
df = pd.DataFrame({'VALUE': range(1000, 9000, 1000)}, index=dates)
print(df)

输出将是

            VALUE
2014-07-01   1000
2014-10-01   2000
2015-01-01   3000
2015-04-01   4000
2015-07-01   5000
2015-07-01   6000
2016-01-01   7000
2016-04-01   8000

如果我们通过例如'6M' 到 df.first（不是聚合器，而是 DataFrame method），我们将选择前六个月的数据，在上面的示例中仅包含两天：< /p>

print(df.first('6M'))

            VALUE
2014-07-01   1000
2014-10-01   2000

同样，last 只返回属于过去六个月数据的行：

print(df.last('6M'))

            VALUE
2016-01-01   6000
2016-04-01   7000

在这种情况下，不传递所需的参数会导致错误：

print(df.first())

<块引用>

TypeError: first() 缺少 1 个必需的位置参数：'offset'

另一方面，df.resample('Y') 返回一个 Resampler object，它具有聚合方法 first、last、mean 等。在这种情况下，它们仅保留每年的第一个（分别为最后一个）值（而不是例如对所有值求平均值或某种其他类型的聚合）：

print(df.resample('Y').first())

            VALUE
2014-12-31   1000
2015-12-31   3000  # This is the first of the 3 values from 2015
2016-12-31   6000

print(df.resample('Y').last())

            VALUE
2014-12-31   2000
2015-12-31   6000  # This is the last of the 3 values from 2015
2016-12-31   7000

作为一个额外的例子，还要考虑按更小的周期分组的情况：

print(df.resample('M').last().head())

             VALUE
2014-07-31  1000.0  # This is the last (and only) value from July, 2014
2014-08-31     NaN  # No data for August, 2014
2014-09-30     NaN  # No data for September, 2014
2014-10-31  2000.0
2014-11-30     NaN  # No data for November, 2014

在这种情况下，任何没有值的期间都将用 NaN 填充。此外，对于此示例，使用 first 而不是 last 会返回相同的值，因为每个月（最多）有一个值。

熊猫聚合器.first（）和.last（）之间的区别

1 个答案: