如何转发列中每个不同值的重采样

时间:2019-10-11 04:04:14

标签: python pandas

首先,我想通过Group_Id转发填充1S中的“每个唯一值”数据,因此基本上按Group_Id进行分组,然后使用ffill重新采样。

以下是数据:

    Id           Timestamp         Data    Group_Id    
0    1     2018-01-01 00:00:05.523 125.5   101 
1    2     2018-01-01 00:00:05.757 125.0   101 
2    3     2018-01-02 00:00:09.507 127.0   52  
3    4     2018-01-02 00:00:13.743 126.5   52  
4    5     2018-01-03 00:00:15.407 125.5   50
                    ...

11   11    2018-01-01 00:00:07.523 125.5   120 
12   12    2018-01-01 00:00:08.757 125.0   120 
13   13    2018-01-04 00:00:14.507 127.0   300  
14   14    2018-01-04 00:00:15.743 126.5   300  
15   15    2018-01-05 00:00:19.407 125.5   350

我以前是这样做的:

def daily_average_temperature(dfdf):
    INDEX = dfdf[['Group_Id','Timestamp','Data']]
    INDEX['Timestamp']=pd.to_datetime(INDEX['Timestamp'])
    INDEX = INDEX.set_index('Timestamp')               
    INDEX1 = INDEX.resample('1S').last().fillna(method='ffill')

    return T_index1

这是错误的,因为它没有首先将具有不同值Group_Id的数据分组,而是忽略了该列。

第二,我想散布数据值,以便每一行都是一个group_id,其中索引替换列Timestamp,看起来像这样:

    x0      x1      x2      x3      x4      x5      ...   Group_Id
0   40      31.05   25.5    25.5    25.5    25      ...   1
1   35      35.75   36.5    36.5    36.5    36.5    ...   2
2   25.5    25.5    25.5    25.5    25.5    25.5    ...   3
3   25.5    25.5    25.5    25.5    25.5    25.5    ...   4
4   25      25      25      25      25      25      ...   5
⋮    ⋮       ⋮        ⋮       ⋮       ⋮        ⋮             ⋮

请注意,以上表格与以前的数据集无关,只是用于显示格式。

谢谢

1 个答案:

答案 0 :(得分:0)

DataFrame.groupbyDataFrameGroupBy.resample一起使用:

def daily_average_temperature(dfdf):
    dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp'])
    dfdf = (dfdf.set_index('Timestamp')
                .groupby('Group_Id')['Data']
                .resample('1S')
                .last()
                .ffill()
                .reset_index())

    return dfdf

print (daily_average_temperature(dfdf))
    Group_Id           Timestamp   Data
0         50 2018-01-03 00:00:15  125.5
1         52 2018-01-02 00:00:09  127.0
2         52 2018-01-02 00:00:10  127.0
3         52 2018-01-02 00:00:11  127.0
4         52 2018-01-02 00:00:12  127.0
5         52 2018-01-02 00:00:13  126.5
6        101 2018-01-01 00:00:05  125.0
7        120 2018-01-01 00:00:07  125.5
8        120 2018-01-01 00:00:08  125.0
9        300 2018-01-04 00:00:14  127.0
10       300 2018-01-04 00:00:15  126.5
11       350 2018-01-05 00:00:19  125.5

编辑:此解决方案在DataFrame.reindex重塑后,在列DattimeIndexdate_rangedef daily_average_temperature(dfdf): dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp']) #remove ms for minimal and maximal seconds in data s = dfdf['Timestamp'].dt.floor('S') dfdf = (dfdf.set_index('Timestamp') .groupby('Group_Id')['Data'] .resample('1S') .last() .unstack() .reindex(pd.date_range(s.min(),s.max(), freq='S'), axis=1, method='ffill') .rename_axis('Timestamp', axis=1) .bfill(axis=1) .ffill(axis=1) .stack() .reset_index(name='Data') ) return dfdf df = daily_average_temperature(dfdf) print (df['Group_Id'].value_counts()) 350 345615 300 345615 120 345615 101 345615 52 345615 50 345615 Name: Group_Id, dtype: int64 中的Series.unstack中使用了最小和最大日期时间,并在必要时加回填: / p>

date_range

另一种解决方案类似,字符串中的值仅指定min(不是由maxdef daily_average_temperature(dfdf): dfdf['Timestamp']=pd.to_datetime(dfdf['Timestamp']) #remove ms for minimal and maximal seconds in data s = dfdf['Timestamp'].dt.floor('S') dfdf = (dfdf.set_index('Timestamp') .groupby('Group_Id')['Data'] .resample('1S') .last() .unstack() .reindex(pd.date_range('2018-01-01','2018-01-08', freq='S'), axis=1, method='ffill') .rename_axis('Timestamp', axis=1) .bfill(axis=1) .ffill(axis=1) .stack() .reset_index(name='Data') ) return dfdf df = daily_average_temperature(dfdf) print (df['Group_Id'].value_counts()) 350 604801 300 604801 120 604801 101 604801 52 604801 50 604801 Name: Group_Id, dtype: int64 动态指定的):

{{1}}