标准化样本数(行)的方法

时间:2019-03-27 23:21:04

标签: python python-3.x

我正在寻找一个基于Python的解决方案,该解决方案将对ID进行分组并找到不同时间范围内数据的平均值。

Input Data

Id  Time    X1  Y1  X2  Y2  X3  Y3
A   0.08    427 351 427 351 427 353
A   0.15    384 365 384 365 384 367
A   0.24    125 190 196 404 196 406
A   0.39    468 342 468 342 398 375
A   0.47    171 457 171 457 171 460
A   0.53    1   343 1   343 1   345
A   0.66    139 328 139 328 139 330
B   0.04    152 179 152 181 150 183
B   0.19    74  75  123 400 123 404
B   0.26    117 99  117 104 116 105
B   0.39    156 125 156 131 71  209
B   0.47    187 147 189 155 187 157
B   0.03    272 340 278 361 249 442
B   0.14    272 351 275 354 250 420
C   0.26    279 347 279 347 266 384
C   0.37    271 337 283 348 258 377


在ID上分组,并根据框架确定范围内X1,Y1,Y2,Y2,X3,Y3的均值。

将为以下范围内的帧计算分组ID的所有X,y值的平均值。如果在该范围内没有x,y值,则返回NaN

1 = (Time <= .1)
2 = (.1 <= Time <= .2)
3 = (.2 <= Time <= .3)
4 = (.3 <= Time <= .4)
5 = (.4 <= Time <= .5)
6 = (.5 <= Time <= .6)
7 = (.6 <= Time <= .7)
8 = (.7 <= Time <= .8)
9 = (.8 <= Time <= .9)


Id  1X1 1Y1 1X2 1Y2 1X3 1Y3  ... 9X3    9Y3  
A   427 351 427 351 427 353
A   384 365 384 365 384 367
A   125 190 196 404 196 406
A   468 342 468 342 398 375
A   171 457 171 457 171 460
A   1   343 1   343 1   345
A   139 328 139 328 139 330
B   152 179 152 181 150 183
B   74  75  123 400 123 404
B   117 99  117 104 116 105
B   156 125 156 131 71  209
B   187 147 189 155 187 157
B   272 340 278 361 249 442
B   272 351 275 354 250 420
C   279 347 279 347 266 384
C   271 337 283 348 258 377

1 个答案:

答案 0 :(得分:0)

我认为您的预期输出有误解。您似乎看到的数字表明您正在沿行旋转“时间轴”,如以下步骤所示。但是,与此同时,列名表明您还沿着列沿X,Y变量中的每一个对bin维度进行了枢转-尽管您没有提供这些数字。

以下是导致输出的时间段位于行中的步骤。

import pandas as pd
import numpy as np

>>>df
   Id  Time   X1   Y1   X2   Y2   X3   Y3
0   A  0.08  427  351  427  351  427  353
1   A  0.15  384  365  384  365  384  367
2   A  0.24  125  190  196  404  196  406
3   A  0.39  468  342  468  342  398  375
4   A  0.47  171  457  171  457  171  460
5   A  0.53    1  343    1  343    1  345
6   A  0.66  139  328  139  328  139  330
7   B  0.04  152  179  152  181  150  183
8   B  0.19   74   75  123  400  123  404
9   B  0.26  117   99  117  104  116  105
10  B  0.39  156  125  156  131   71  209
11  B  0.47  187  147  189  155  187  157
12  B  0.03  272  340  278  361  249  442
13  B  0.14  272  351  275  354  250  420
14  C  0.26  279  347  279  347  266  384
15  C  0.37  271  337  283  348  258  377

# This is the base operation that you're looking for to produce the output in your example
df = df.groupby(['Id', pd.cut(df['Time'], np.arange(0, 1.0, 0.1))]).mean()
>>>df
                Time     X1     Y1     X2     Y2     X3     Y3
Id Time
A  (0.0, 0.1]  0.080  427.0  351.0  427.0  351.0  427.0  353.0
   (0.1, 0.2]  0.150  384.0  365.0  384.0  365.0  384.0  367.0
   (0.2, 0.3]  0.240  125.0  190.0  196.0  404.0  196.0  406.0
   (0.3, 0.4]  0.390  468.0  342.0  468.0  342.0  398.0  375.0
   (0.4, 0.5]  0.470  171.0  457.0  171.0  457.0  171.0  460.0
   (0.5, 0.6]  0.530    1.0  343.0    1.0  343.0    1.0  345.0
   (0.6, 0.7]  0.660  139.0  328.0  139.0  328.0  139.0  330.0
   (0.7, 0.8]    NaN    NaN    NaN    NaN    NaN    NaN    NaN
   (0.8, 0.9]    NaN    NaN    NaN    NaN    NaN    NaN    NaN
B  (0.0, 0.1]  0.035  212.0  259.5  215.0  271.0  199.5  312.5
   (0.1, 0.2]  0.165  173.0  213.0  199.0  377.0  186.5  412.0
   (0.2, 0.3]  0.260  117.0   99.0  117.0  104.0  116.0  105.0
   (0.3, 0.4]  0.390  156.0  125.0  156.0  131.0   71.0  209.0
   (0.4, 0.5]  0.470  187.0  147.0  189.0  155.0  187.0  157.0
   (0.5, 0.6]    NaN    NaN    NaN    NaN    NaN    NaN    NaN
   (0.6, 0.7]    NaN    NaN    NaN    NaN    NaN    NaN    NaN
   (0.7, 0.8]    NaN    NaN    NaN    NaN    NaN    NaN    NaN
   (0.8, 0.9]    NaN    NaN    NaN    NaN    NaN    NaN    NaN
C  (0.0, 0.1]    NaN    NaN    NaN    NaN    NaN    NaN    NaN
   (0.1, 0.2]    NaN    NaN    NaN    NaN    NaN    NaN    NaN
   (0.2, 0.3]  0.260  279.0  347.0  279.0  347.0  266.0  384.0
   (0.3, 0.4]  0.370  271.0  337.0  283.0  348.0  258.0  377.0
   (0.4, 0.5]    NaN    NaN    NaN    NaN    NaN    NaN    NaN
   (0.5, 0.6]    NaN    NaN    NaN    NaN    NaN    NaN    NaN
   (0.6, 0.7]    NaN    NaN    NaN    NaN    NaN    NaN    NaN
   (0.7, 0.8]    NaN    NaN    NaN    NaN    NaN    NaN    NaN
   (0.8, 0.9]    NaN    NaN    NaN    NaN    NaN    NaN    NaN

"""
The rest are just cosmetics
"""
# Drop the original Time column
df.drop('Time', axis=1, inplace=True)
# Reset the index
df.reset_index(inplace=True)
# Add a numerical label for the Time bins
df['TimeNo'] = (df.index % 9) + 1
# Rearrange the columns
df = df.iloc[:,[0,1,8]].join(df.iloc[:,2:8])
# Drop the NaN rows
df = df.loc[np.sum(df.iloc[:,3:], axis=1)>0]

>>>df
   Id        Time  TimeNo     X1     Y1     X2     Y2     X3     Y3
0   A  (0.0, 0.1]       1  427.0  351.0  427.0  351.0  427.0  353.0
1   A  (0.1, 0.2]       2  384.0  365.0  384.0  365.0  384.0  367.0
2   A  (0.2, 0.3]       3  125.0  190.0  196.0  404.0  196.0  406.0
3   A  (0.3, 0.4]       4  468.0  342.0  468.0  342.0  398.0  375.0
4   A  (0.4, 0.5]       5  171.0  457.0  171.0  457.0  171.0  460.0
5   A  (0.5, 0.6]       6    1.0  343.0    1.0  343.0    1.0  345.0
6   A  (0.6, 0.7]       7  139.0  328.0  139.0  328.0  139.0  330.0
9   B  (0.0, 0.1]       1  212.0  259.5  215.0  271.0  199.5  312.5
10  B  (0.1, 0.2]       2  173.0  213.0  199.0  377.0  186.5  412.0
11  B  (0.2, 0.3]       3  117.0   99.0  117.0  104.0  116.0  105.0
12  B  (0.3, 0.4]       4  156.0  125.0  156.0  131.0   71.0  209.0
13  B  (0.4, 0.5]       5  187.0  147.0  189.0  155.0  187.0  157.0
20  C  (0.2, 0.3]       3  279.0  347.0  279.0  347.0  266.0  384.0
21  C  (0.3, 0.4]       4  271.0  337.0  283.0  348.0  258.0  377.0

如您所见,使用这种输出格式,您无需将“时间段”放在各列中。