首先,我想通过Group_Id
转发填充1S
中的“每个唯一值”数据,因此基本上按Group_Id
进行分组,然后使用ffill
重新采样。
以下是数据:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
2 3 2018-01-02 00:00:09.507 127.0 52
3 4 2018-01-02 00:00:13.743 126.5 52
4 5 2018-01-03 00:00:15.407 125.5 50
...
11 11 2018-01-01 00:00:07.523 125.5 120
12 12 2018-01-01 00:00:08.757 125.0 120
13 13 2018-01-04 00:00:14.507 127.0 300
14 14 2018-01-04 00:00:15.743 126.5 300
15 15 2018-01-05 00:00:19.407 125.5 350
我以前是这样做的:
def daily_average_temperature(dfdf):
INDEX = dfdf[['Group_Id','Timestamp','Data']]
INDEX['Timestamp']=pd.to_datetime(INDEX['Timestamp'])
INDEX = INDEX.set_index('Timestamp')
INDEX1 = INDEX.resample('1S').last().fillna(method='ffill')
return T_index1
这是错误的,因为它没有首先将具有不同值Group_Id
的数据分组,而是忽略了该列。
第二,我想散布数据值,以便每一行都是一个group_id,其中索引替换列Timestamp
,看起来像这样:
x0 x1 x2 x3 x4 x5 ... Group_Id
0 40 31.05 25.5 25.5 25.5 25 ... 1
1 35 35.75 36.5 36.5 36.5 36.5 ... 2
2 25.5 25.5 25.5 25.5 25.5 25.5 ... 3
3 25.5 25.5 25.5 25.5 25.5 25.5 ... 4
4 25 25 25 25 25 25 ... 5
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
请注意,以上表格与以前的数据集无关,只是用于显示格式。
谢谢
答案 0 :(得分:0)
将DataFrame.groupby
与DataFrameGroupBy.resample
一起使用:
def daily_average_temperature(dfdf):
dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp'])
dfdf = (dfdf.set_index('Timestamp')
.groupby('Group_Id')['Data']
.resample('1S')
.last()
.ffill()
.reset_index())
return dfdf
print (daily_average_temperature(dfdf))
Group_Id Timestamp Data
0 50 2018-01-03 00:00:15 125.5
1 52 2018-01-02 00:00:09 127.0
2 52 2018-01-02 00:00:10 127.0
3 52 2018-01-02 00:00:11 127.0
4 52 2018-01-02 00:00:12 127.0
5 52 2018-01-02 00:00:13 126.5
6 101 2018-01-01 00:00:05 125.0
7 120 2018-01-01 00:00:07 125.5
8 120 2018-01-01 00:00:08 125.0
9 300 2018-01-04 00:00:14 127.0
10 300 2018-01-04 00:00:15 126.5
11 350 2018-01-05 00:00:19 125.5
编辑:此解决方案在DataFrame.reindex
重塑后,在列DattimeIndex
中date_range
在def daily_average_temperature(dfdf):
dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp'])
#remove ms for minimal and maximal seconds in data
s = dfdf['Timestamp'].dt.floor('S')
dfdf = (dfdf.set_index('Timestamp')
.groupby('Group_Id')['Data']
.resample('1S')
.last()
.unstack()
.reindex(pd.date_range(s.min(),s.max(), freq='S'), axis=1, method='ffill')
.rename_axis('Timestamp', axis=1)
.bfill(axis=1)
.ffill(axis=1)
.stack()
.reset_index(name='Data')
)
return dfdf
df = daily_average_temperature(dfdf)
print (df['Group_Id'].value_counts())
350 345615
300 345615
120 345615
101 345615
52 345615
50 345615
Name: Group_Id, dtype: int64
中的Series.unstack
中使用了最小和最大日期时间,并在必要时加回填: / p>
date_range
另一种解决方案类似,字符串中的值仅指定min
(不是由max
和def daily_average_temperature(dfdf):
dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp'])
#remove ms for minimal and maximal seconds in data
s = dfdf['Timestamp'].dt.floor('S')
dfdf = (dfdf.set_index('Timestamp')
.groupby('Group_Id')['Data']
.resample('1S')
.last()
.unstack()
.reindex(pd.date_range('2018-01-01','2018-01-08', freq='S'),
axis=1, method='ffill')
.rename_axis('Timestamp', axis=1)
.bfill(axis=1)
.ffill(axis=1)
.stack()
.reset_index(name='Data')
)
return dfdf
df = daily_average_temperature(dfdf)
print (df['Group_Id'].value_counts())
350 604801
300 604801
120 604801
101 604801
52 604801
50 604801
Name: Group_Id, dtype: int64
动态指定的):
{{1}}