如何使用Python和/或R在数据帧之间插值

时间:2016-06-04 18:34:54

标签: python dataframe interpolation panel-data

我有一个如下所示的数据集:

image of a dataset

我使用pandas.read_csv将“年份”和“国家/地区”列作为索引导入到pandas数据框中。 我需要做的是将时间步长从每5年改为每年,并插入所述值,我真的不知道如何做到这一点。 我正在学习R和python,所以对这两种语言的帮助都会受到高度赞赏。

3 个答案:

答案 0 :(得分:6)

  • 如果您为DataFrame提供DatetimeIndex,那么您可以利用df.resampledf.interpolate('time')方法。

  • 要使df.index成为DatetimeIndex,您可能会想要使用set_index('Year')。但是,Year本身并不是唯一的,因为每个Country都会重复resample。为了致电df.pivot,我们需要一个唯一的索引。因此请改用# convert integer years into `datetime64` values In [441]: df['Year'] = (df['Year'].astype('i8')-1970).view('datetime64[Y]') In [442]: df.pivot(index='Year', columns='Country') Out[442]: Avg1 Avg2 Country Australia Austria Belgium Australia Austria Belgium Year 1950-01-01 0 0 0 0 0 0 1955-01-01 1 1 1 10 10 10 1960-01-01 2 2 2 20 20 20 1965-01-01 3 3 3 30 30 30

    df.resample('A').mean()
  • 然后,您可以每年使用resample('A')resample the data 频率。您可以将df视为将resample整理成一组 每隔1年。 DatetimeIndexResampler返回mean个对象 mean()方法通过取均值来聚合每个组中的值。从而 df每年都会返回一行DataFrame。既然你原来的 .mean()每5年有一个数据,大多数1年组都是空的,所以 平均值返回那些年份的NaNs。如果您的数据始终如一 每隔5年,您可以使用.first().last()代替In [438]: df.resample('A').mean() Out[438]: Avg1 Avg2 Country Australia Austria Belgium Australia Austria Belgium Year 1950-12-31 0.0 0.0 0.0 0.0 0.0 0.0 1951-12-31 NaN NaN NaN NaN NaN NaN 1952-12-31 NaN NaN NaN NaN NaN NaN 1953-12-31 NaN NaN NaN NaN NaN NaN 1954-12-31 NaN NaN NaN NaN NaN NaN 1955-12-31 1.0 1.0 1.0 10.0 10.0 10.0 1956-12-31 NaN NaN NaN NaN NaN NaN 1957-12-31 NaN NaN NaN NaN NaN NaN 1958-12-31 NaN NaN NaN NaN NaN NaN 1959-12-31 NaN NaN NaN NaN NaN NaN 1960-12-31 2.0 2.0 2.0 20.0 20.0 20.0 1961-12-31 NaN NaN NaN NaN NaN NaN 1962-12-31 NaN NaN NaN NaN NaN NaN 1963-12-31 NaN NaN NaN NaN NaN NaN 1964-12-31 NaN NaN NaN NaN NaN NaN 1965-12-31 3.0 3.0 3.0 30.0 30.0 30.0 而是df.interpolate(method='time')。他们都会返回相同的结果。

    import numpy as np
    import pandas as pd
    
    countries = 'Australia Austria Belgium'.split()
    year = np.arange(1950, 1970, 5)
    df = pd.DataFrame(
        {'Country': np.repeat(countries, len(year)),
         'Year': np.tile(year, len(countries)),
         'Avg1': np.tile(np.arange(len(year)), len(countries)),
         'Avg2': 10*np.tile(np.arange(len(year)), len(countries))})
    df['Year'] = (df['Year'].astype('i8')-1970).view('datetime64[Y]')
    df = df.pivot(index='Year', columns='Country')
    
    df = df.resample('A').mean()
    df = df.interpolate(method='time')
    
    df = df.stack('Country')
    df = df.reset_index()
    df = df.sort_values(by=['Country', 'Year'])
    print(df)
    
  • 然后 Year Country Avg1 Avg2 0 1950-12-31 Australia 0.000000 0.000000 3 1951-12-31 Australia 0.199890 1.998905 6 1952-12-31 Australia 0.400329 4.003286 9 1953-12-31 Australia 0.600219 6.002191 12 1954-12-31 Australia 0.800110 8.001095 15 1955-12-31 Australia 1.000000 10.000000 18 1956-12-31 Australia 1.200328 12.003284 21 1957-12-31 Australia 1.400109 14.001095 ... 将根据最近的非NaN值及其相关的日期时间索引值线性插入缺失的NaN值。

env:
  global:
    - "FTP_USER=user"
    - "FTP_PASSWORD=password"
after_success:
    "curl --ftp-create-dirs -T uploadfilename -u $FTP_USER:$FTP_PASSWORD ftp://sitename.com/directory/myfile"

产量

after_success:
  - eval "$(ssh-agent -s)" #start the ssh agent
  - chmod 600 .travis/deploy_key.pem # this key should have push access
  - ssh-add .travis/deploy_key.pem
  - git remote add deploy DEPLOY_REPO_URI_GOES_HERE
  - git push deploy

答案 1 :(得分:1)

这是一个艰难的,但我认为我有。

以下是一个示例数据框的示例:

df = pd.DataFrame({'country': ['australia', 'australia', 'belgium','belgium'], 
                   'year': [1980, 1985, 1980, 1985],
                   'data1': [1,5, 10, 15],
                   'data2': [100,110, 150,160]})
df = df.set_index(['country','year'])
countries = set(df.index.get_level_values(0))
df = df.reindex([(country, year) for country in countries for year in range(1980,1986)])
df = df.interpolate()
df = df.reset_index()

对于您的具体数据,假设每个国家/地区在1950年至2010年(包括)之间每5年都有一次数据,那么

df = pd.read_csv('path_to_data')
df = df.set_index(['country','year'])
countries = set(df.index.get_level_values(0))
df = df.reindex([(country, year) for country in countries for year in range(1950,2011)])
df = df.interpolate()
df = df.reset_index()

有点棘手的问题。有兴趣看看有人有更好的解决方案

答案 2 :(得分:0)

首先,重新索引框架。然后使用df.applySeries.interpolate

类似的东西:

import pandas as pd

df = pd.read_csv(r'folder/file.txt')
rows = df.shape[0]
df.index = [x for x in range(0, 5*rows, 5)]
df = df.reindex(range(0, 5*rows))
df.apply(pandas.Series.interpolate)
df.apply(pd.Series.interpolate, inplace=True)