我的数据框看起来像这样:
timestamp battery_state battery_level
0 2017-10-08 13:42:02 Charging 0.94
1 2017-10-08 13:45:43 Charging 0.95
2 2017-10-08 13:49:08 Charging 0.96
3 2017-10-08 13:54:07 Charging 0.97
4 2017-10-08 13:57:26 Charging 0.98
5 2017-10-08 14:01:35 Charging 0.99
6 2017-10-08 14:03:03 Full 1.00
7 2017-10-08 14:17:19 Charging 0.98
8 2017-10-08 14:26:05 Charging 0.97
9 2017-10-08 14:46:10 Charging 0.98
10 2017-10-08 14:47:47 Full 1.00
11 2017-10-08 16:36:24 Charging 0.91
12 2017-10-08 16:40:32 Charging 0.92
13 2017-10-08 16:47:58 Charging 0.93
14 2017-10-08 16:51:51 Charging 0.94
15 2017-10-08 16:55:26 Charging 0.95
正如你们在这个数据框中看到的那样,3个样本子集对应于设备充电期:
注意:充电期间并非始终为满状态,例如样品11至15
目标是将这3个句点变为变量并在它们成立时对其进行处理。
为了做到这一点,我已经制作了这段代码:
previous_index = 0 #stores the initial index of each period
for index in islice(device_charge_samples.index, 1, None): #use islice because the first row does not have privious sample to compare
#creates a period by comparing each line two by two
if device_charge_samples.get_value(index, 'battery_level') < device_charge_samples.get_value(index - 1, 'battery_level'):
subset = device_charge_samples[previous_index:index].reset_index(drop=True)
#Process subset function here
previous_index = index
#last period case
if index == len(device_charge_samples) - 1:
subset = device_charge_samples[previous_index:index + 1].reset_index(drop=True)
#Process subset function here
我已经为for循环中的 device_charge_samples.index 替换了 device_charge_samples.iteraterows(),我替换了 device_charge_samples.loc [index,&#39; battery_level device_charge_samples.get_value(索引,&#39; battery_level&#39;),两者都有很大帮助。
我可以做任何其他优化吗? ,就像使用数据帧应用功能(它似乎作为每一行的循环,但我不知道如何在这种情况下使用它,或者甚至是否值得使用她),或任何其他优化,我可以在我的解决方案中使用
答案 0 :(得分:2)
首先创建一个使用cumsum
df['group'] = (df.battery_state == 'Full').cumsum().shift(1).fillna(0)
现在,您可以遍历组而不是遍历行
for index, frame in df.groupby('group'):
subsetFunction(frame)
答案 1 :(得分:1)
您可以np.split()
使用battery_state == 'Full'
并删除这些行。
m = df['battery_state'] == 'Full'
for subset in np.split(df[~m],df.index[m] - np.arange(sum(m))):
#1000 loops, best of 3: 783 µs per loop
# do something with subset here
或者DJK把它放在一个cumsum(但这里是一个更紧凑的公平时间版本)
m = df.battery_state == 'Full'
for idx, subset in df[~m].groupby(m.cumsum()):
# 1000 loops, best of 3: 999 µs per loop
# do something with subset here
完整示例:
import pandas as pd
import numpy as np
data = '''\
timestamp battery_state battery_level
2017-10-08T13:42:02 Charging 0.94
2017-10-08T13:45:43 Charging 0.95
2017-10-08T13:49:08 Charging 0.96
2017-10-08T13:54:07 Charging 0.97
2017-10-08T13:57:26 Charging 0.98
2017-10-08T14:01:35 Charging 0.99
2017-10-08T14:03:03 Full 1.00
2017-10-08T14:17:19 Charging 0.98
2017-10-08T14:26:05 Charging 0.97
2017-10-08T14:46:10 Charging 0.98
2017-10-08T14:47:47 Full 1.00
2017-10-08T16:36:24 Charging 0.91
2017-10-08T16:40:32 Charging 0.92
2017-10-08T16:47:58 Charging 0.93
2017-10-08T16:51:51 Charging 0.94
2017-10-08T16:55:26 Charging 0.95'''
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+', parse_dates=['timestamp'])
m = df['battery_state'] == 'Full'
for subset in np.split(df[~m],df.index[m] - np.arange(sum(m))):
print(subset)
返回:
timestamp battery_state battery_level
0 2017-10-08 13:42:02 Charging 0.94
1 2017-10-08 13:45:43 Charging 0.95
2 2017-10-08 13:49:08 Charging 0.96
3 2017-10-08 13:54:07 Charging 0.97
4 2017-10-08 13:57:26 Charging 0.98
5 2017-10-08 14:01:35 Charging 0.99
timestamp battery_state battery_level
7 2017-10-08 14:17:19 Charging 0.98
8 2017-10-08 14:26:05 Charging 0.97
9 2017-10-08 14:46:10 Charging 0.98
timestamp battery_state battery_level
11 2017-10-08 16:36:24 Charging 0.91
12 2017-10-08 16:40:32 Charging 0.92
13 2017-10-08 16:47:58 Charging 0.93
14 2017-10-08 16:51:51 Charging 0.94
15 2017-10-08 16:55:26 Charging 0.95