我下面有一个很大的csv文件示例,
data = pd.read_csv('C:/Users/Ene_E/Desktop/Data/data.csv')
data.head()
name year value
0 Afghanistan 1800 603
1 Albania 1800 667
2 Algeria 1800 715
3 Andorra 1800 1200
4 Angola 1800 618
data.tail()
name year value
46508 Venezuela 2040 17600
46509 Vietnam 2040 12300
46510 Yemen 2040 3960
46511 Zambia 2040 6590
46512 Zimbabwe 2040 3210
我的大型CSV有200多个国家,并且从1800年至2040年每年记录其数据,我的目标是将该数据重新采样到每月并内插值列,如下所示,我使用了阿富汗,即1800年来说明我的期望最终结果
预期输出:
name date value
Afghanistan Jan 1800 start_value
Afghanistan Feb 1800 .
Afghanistan Mar 1800 .
Afghanistan May 1800 .
Afghanistan Jun 1800 .
Afghanistan Jul 1800 .This column is interpolated smoothly
Afghanistan Aug 1800 .
Afghanistan Sep 1800 .
Afghanistan Oct 1800 .
Afghanistan Nov 1800 .
Afghanistan Dec 1800 603(end value in that year)
我希望像上面在python中那样对所有数据进行重新采样,因为这是我的模型可以工作的唯一方式。 注意:日期应采用上述格式。
我尝试了几次都没有成功,
data['year'] = pd.to_datetime(data.year, format='%Y')
head(data)
错误:
Traceback (most recent call last): File "<pyshell#12>", line 1, in <module>
head(data) NameError: name 'head' is not defined
data.head()
name year value
0 Afghanistan 1800-01-01 00:00:00 603
1 Albania 1800-01-01 00:00:00 667
2 Algeria 1800-01-01 00:00:00 715
3 Andorra 1800-01-01 00:00:00 1200
4 Angola 1800-01-01 00:00:00 618
data.resample('1M', how='interpolate')
错误:
Traceback (most recent call last):
File "<pyshell#14>", line 1, in <module>
data.resample('1M', how='interpolate')
File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 8145, in resample
base=base, key=on, level=level)
File "C:\Python27\lib\site-packages\pandas\core\resample.py", line 1251, in resample
return tg._get_resampler(obj, kind=kind)
File "C:\Python27\lib\site-packages\pandas\core\resample.py", line 1381, in _get_resampler
"but got an instance of %r" % type(ax).__name__)
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
data.groupby(name).resample('1M', how='interpolate')
错误:
Traceback (most recent call last):
File "<pyshell#15>", line 1, in <module>
data.groupby(name).resample('1M', how='interpolate')
NameError: name 'name' is not defined
有想法吗?
答案 0 :(得分:0)
使用np.where
有条件地为缺少日期值的地方分配名称data['value']=np.where(data['date'].isna(), 'This column is interpolated smoothly', '')#.data.ffill(axis=0, inplace=True)
提前填写缺少的日期
data['date']=pd.to_datetime(data['date']).ffill()
按日期分组并重置回数据框
data.set_index('date', inplace=True)
data['value'] = np.where( data.index.month== 1, 'start_value', data['value'])
data['value'] = np.where( data.index.month== 12, 'End_value', data['value'])
data.groupby(data.index.month)['name', 'value'].ffill().reset_index().sort_values(by=['name','date'], ascending=True).drop_duplicates()
答案 1 :(得分:0)
@DEVELOPER_ONE我不熟悉插值或重采样,但是我想给它一种不同的方法。我的字面意思与您所需的输出类似:
import pandas as pd
import numpy as np
data = pd.DataFrame({'name':['Afghanistan', 'Albania', 'Zimbabwe','Afghanistan',
'Albania', 'Zimbabwe'],
'year':[1800,1800,1800,2040,2040,2040],
'value' : [603,667,59,2415,2804,3210]
})
df_year_unique = pd.DataFrame(data['year'].drop_duplicates().reset_index(drop=True))
df_name_unique = pd.DataFrame(data['name'].drop_duplicates().reset_index(drop=True))
df_month_unique = pd.DataFrame({'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']})
df_name = pd.DataFrame(pd.concat([df_name_unique]*len(
df_month_unique)*len(df_year_unique),
ignore_index=True)).sort_values('name').reset_index(drop=True)
df_month = pd.DataFrame(pd.concat([df_month_unique]*len(
df_year_unique)*len(df_name_unique),
ignore_index=True))
df_year = pd.DataFrame(pd.concat([df_year_unique]*len(
df_month_unique)*len(df_name_unique),
ignore_index=True)).sort_values('year').reset_index(drop=True)
df_year_month = pd.merge(df_month, df_year, how='inner', left_index=True,
right_index=True)
df_year_month_name = pd.merge(df_year_month, df_name, how='inner', left_index=True,
right_index=True)
df = pd.merge(df_year_month_name, data, how='left', on=['name','year'])
df['value'] = np.where(df['Month'] != 'Dec', '.', df['value'])
df['value'] = np.where(df['Month'] == 'Jan', 'start_value', df['value'])
df['value'] = np.where(df['Month'] == 'Jul', '.This column is interpolated smoothly',
df['value'])
df