按时间间隔在pandas数据帧中分组

时间:2017-09-18 09:14:44

标签: python pandas dataframe

我有一个pandas DataFrame导入2列(时间,心率)。

时间是格式MM:SS.s(分钟:Seconds.miliseconds)。我试图将这个时间转换成一个秒的浮点数(例如0.6s或65.3s)(以后用于折叠成10s窗口)。例如:

import pandas as pd
hr_raw = pd.read_csv('hr_data.csv')
hr_raw.dropna(inplace=True)
print(hr_raw.head())

   Time       HR bpm
0  00:00.6    97.0
1  00:01.0    92.0
2  00:01.3    80.0
3  00:01.6    81.0
4  00:02.0    80.0

以前(使用标准CSV模块导入时)我只是将此字符串拆分,转换为浮点数并进行数学计算以将其转换为秒:

 with open('hr_data.csv', 'rU') as infile:
     hr_data = list(csv.DictReader(infile, delimiter=','))
     for row in hr_data:
         temp = row['Time']
         time.append(float(temp[3:7]) + (float(temp[0:2]) * 60))

现在我正在使用熊猫,但代码不能正常工作。我试图修改,以便我访问“时间”#39;专栏(见下文),但没有太多运气。

import pandas as pd

win_size = 10  # user defined window in seconds

hr_raw = pd.read_csv('hr_data.csv')
hr_raw.dropna(inplace=True) #remove NaN artifact from import

#### problem code ####
for row in hr_raw.Time:
    hr_raw.Time[row] = float(hr_raw.Time[row][3:]) + float((hr_raw.Time[row][0:2] * 60))

# set time as index
hr_raw.set_index('Time', inplace=True)

# bin data based on user defined window
hr_bin = hr_raw.groupby((hr_raw.index // win_size + 1) * win_size).mean()

出现的错误是:

Traceback (most recent call last):
  File "pandas\_libs\index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5126)
  File "pandas\_libs\hashtable_class_helper.pxi", line 759, in pandas._libs.hashtable.Int64HashTable.get_item (pandas\_libs\hashtable.c:14010)
TypeError: an integer is required

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\mitbl001\Dropbox\CPET_python\import_hr_csv.py", line 11, in <module>
    hr_raw.Time[row] = float(hr_raw.Time[row][3:]) + float((hr_raw.Time[row][0:2] * 60))
  File "C:\Users\mitbl001\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\series.py", line 601, in __getitem__
    result = self.index.get_value(self, key)
  File "C:\Users\mitbl001\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pandas\core\indexes\base.py", line 2477, in get_value
    tz=getattr(series.dtype, 'tz', None))
  File "pandas\_libs\index.pyx", line 98, in pandas._libs.index.IndexEngine.get_value (pandas\_libs\index.c:4404)
  File "pandas\_libs\index.pyx", line 106, in pandas._libs.index.IndexEngine.get_value (pandas\_libs\index.c:4087)
  File "pandas\_libs\index.pyx", line 156, in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5210)
KeyError: '00:00.6'

2 个答案:

答案 0 :(得分:1)

我认为您需要indexing with str astype加注float

hr_raw.Time = hr_raw.Time.str[3:].astype(float) + hr_raw.Time.str[0:2].astype(float) * 60
print (hr_raw)

   Time  HR bpm
0   0.6    97.0
1   1.0    92.0
2   1.3    80.0
3   1.6    81.0
4   2.0    80.0

另一个解决方案是转换to_timedelta,但在从radd右侧添加hour之前:

hr_raw.Time = pd.to_timedelta(hr_raw.Time.radd('00:')).dt.total_seconds()
print (hr_raw)

   Time  HR bpm
0   0.6    97.0
1   1.0    92.0
2   1.3    80.0
3   1.6    81.0
4   2.0    80.0

然后不需要set_index,请使用列Time

# bin data based on user defined window
hr_bin = hr_raw.groupby((hr_raw.Time // win_size + 1) * win_size).mean()
print (hr_bin)
      Time  HR bpm
Time              
10.0   1.3    86.0

答案 1 :(得分:1)

使用pd.to_timedelta

将时间列转换为浮动
df['Time'] = pd.to_timedelta('00:' + df.Time).dt.total_seconds()
df

   Time  HR bpm
0   0.6    97.0
1   1.0    92.0
2   1.3    80.0
3   1.6    81.0
4   2.0    80.0

groupby现在应该很简单,使用语法:

df.groupby(df.Time // x * x)

x是您所需的时间窗口。这是一个以0.5秒的间隔分组并取心率平均值的例子:

df.groupby(df.Time // 0.5 * 0.5)['HR bpm'].mean()

Time
0.5    97.0
1.0    86.0
1.5    81.0
2.0    80.0
Name: HR bpm, dtype: float64

以上输出一系列。如果要获取数据帧,可以在groupby之后调用reset_index

df.groupby(df.Time // 0.5 * 0.5)['HR bpm'].mean().reset_index()

   Time  HR bpm
0   0.5    97.0
1   1.0    86.0
2   1.5    81.0
3   2.0    80.0

在您的情况下,您可以按照df.groupby(df.Time // 10 * 10)的方式执行某些操作。