从Dataframe中提取一部分值。

时间:2017-09-06 13:23:30

标签: python pandas dataframe

我想编写一个函数来选择Dataframe的一部分,这样: 在输入中给出" first_non_zero_index" (在我们的例子中是行的索引)和值阈值(例如4),返回索引,称为" last_non_zero_index",这样df.loc [first_non_zero_index:last_index]将产生输出结果。此外,两个无零值之间只允许最多10个连续的零。

我非常感谢你的帮助。非常感激。 提前谢谢了。 卡罗

输入数据框:

id, ts,value,
id1,2017-04-27 01:35:30,0
id1,2017-04-27 01:36:30,0
id1,2017-04-27 01:37:00,0
id1,2017-04-27 01:38:00,0
id1,2017-04-27 01:39:00,0
id1,2017-04-27 01:40:00,0
id1,2017-04-27 01:41:00,0
id1,2017-04-27 01:42:00,0
id1,2017-04-27 01:43:00,0
id1,2017-04-27 01:44:00,0
id1,2017-04-27 01:45:30,4.0
id1,2017-04-27 01:46:00,99.0
id1,2017-04-27 01:47:30,100.0
id1,2017-04-27 01:48:30,100.0
id1,2017-04-27 01:49:30,100.0
id1,2017-04-27 01:50:30,100.0
id1,2017-04-27 01:51:30,100.0
id1,2017-04-27 01:52:00,100.0
id1,2017-04-27 01:53:00,0
id1,2017-04-27 01:54:00,0
id1,2017-04-27 02:55:30,5.0
id1,2017-04-27 02:56:00,6.0  
id1,2017-04-27 02:57:30,7.0 
id1,2017-04-27 02:58:00,8.0
id1,2017-04-27 02:59:30,4.0
id1,2017-04-27 02:00:30,0
id1,2017-04-27 02:01:30,0
id1,2017-04-27 02:02:00,0
id1,2017-04-27 02:03:00,0
id1,2017-04-27 02:04:00,0
id1,2017-04-27 02:05:00,0
id1,2017-04-27 02:06:00,0
id1,2017-04-27 02:07:00,0
id1,2017-04-27 02:08:00,0
id1,2017-04-27 02:09:00,0
id1,2017-04-27 02:10:00,0
id1,2017-04-27 02:11:00,0
id1,2017-04-27 02:12:30,4.0
id1,2017-04-27 02:13:00,99.0
id1,2017-04-27 02:14:30,1000.0
id1,2017-04-27 02:15:30,1000.0
id1,2017-04-27 02:16:30,1000.0
id1,2017-04-27 02:17:30,1000.0
id1,2017-04-27 02:18:30,1000.0
id1,2017-04-27 01:19:00,1000.0
id1,2017-04-27 02:20:00,0
id1,2017-04-27 02:20:00,0
id1,2017-04-27 02:21:00,0
id1,2017-04-27 02:22:30,5.0
id1,2017-04-27 02:23:00,6.0  
id1,2017-04-27 02:24:30,7.0 
id1,2017-04-27 02:25:00,8.0
id1,2017-04-27 02:26:30,4.0
id1,2017-04-27 02:27:30,0
id1,2017-04-27 02:28:00,0
id1,2017-04-27 02:29:00,0
id1,2017-04-27 02:30:00,0
id1,2017-04-27 02:31:00,0
id1,2017-04-27 02:32:00,0
id1,2017-04-27 02:33:00,0
id1,2017-04-27 02:34:00,0
id1,2017-04-27 02:35:00,0
id1,2017-04-27 02:36:00,0
id1,2017-04-27 02:37:00,0

输出数据帧:

id, ts,value,
id1,2017-04-27 01:45:30,4.0
id1,2017-04-27 01:46:00,99.0
id1,2017-04-27 01:47:30,100.0
id1,2017-04-27 01:48:30,100.0
id1,2017-04-27 01:49:30,100.0
id1,2017-04-27 01:50:30,100.0
id1,2017-04-27 01:51:30,100.0
id1,2017-04-27 01:52:00,100.0
id1,2017-04-27 01:53:00,0
id1,2017-04-27 01:54:00,0
id1,2017-04-27 02:55:30,5.0
id1,2017-04-27 02:56:00,6.0  
id1,2017-04-27 02:57:30,7.0 
id1,2017-04-27 02:58:00,8.0
id1,2017-04-27 02:59:30,4.0

2 个答案:

答案 0 :(得分:2)

这可以让你走上正轨。它将接受输入DataFrame并返回一个输出DataFrame,该输出从第一个传递“阈值”的元素到传递它的最后一个元素。

import pandas as pd
df = pd.read_csv('data.csv')

def extractPartialDataframe(df, threshold):
    indicesList = df[df.value >= threshold].index.tolist()
    new_df = df.iloc[min(indicesList): max(indicesList) + 1]
    new_df.reset_index(inplace=True)
    return new_df

trimmedDF = extractPartialDataframe(df, 4)
maxConsecutiveZeros = 10
consecutives = trimmedDF["value"].groupby((trimmedDF["value"]!=trimmedDF["value"].shift(1)).cumsum()).transform('count')

tooManyConsecutiveZeros = trimmedDF[(trimmedDF["value"] == 0)&( consecutives > maxConsecutiveZeros)].index.tolist()
final_df = trimmedDF.iloc[:tooManyConsecutiveZeros[0]]
print (final_df)
OUTPUT

10  id1  2017-04-27 01:45:30    4.0
11  id1  2017-04-27 01:46:00   99.0
12  id1  2017-04-27 01:47:30  100.0
13  id1  2017-04-27 01:48:30  100.0
14  id1  2017-04-27 01:49:30  100.0
15  id1  2017-04-27 01:50:30  100.0
16  id1  2017-04-27 01:51:30  100.0
17  id1  2017-04-27 01:52:00  100.0
18  id1  2017-04-27 01:53:00    0.0
19  id1  2017-04-27 01:54:00    0.0
20  id1  2017-04-27 02:55:30    5.0
21  id1  2017-04-27 02:56:00    6.0
22  id1  2017-04-27 02:57:30    7.0
23  id1  2017-04-27 02:58:00    8.0
24  id1  2017-04-27 02:59:30    4.0

答案 1 :(得分:0)

我认为按阈值表示分隔符(阈值可能表示>=<=操作,而我想您需要==所需的输出,因为有些值都会降低在你的例子中高于4。)

找到第一个非零值:

start = (df['value'] != 0).tolist().index(True)

找到分隔符(在您的示例中为4.0):

fours = (df['value'] == 4).tolist()

在第一个非零之后,将第一个非零切片到第一个分隔符:

df.iloc[start:fours.index(True, start+1)+1]

输出应该与您的示例类似:

         id                   ts  value
    10  id1  2017-04-27 01:45:30    4.0
    11  id1  2017-04-27 01:46:00   99.0
    12  id1  2017-04-27 01:47:30  100.0
    13  id1  2017-04-27 01:48:30  100.0
    14  id1  2017-04-27 01:49:30  100.0
    15  id1  2017-04-27 01:50:30  100.0
    16  id1  2017-04-27 01:51:30  100.0
    17  id1  2017-04-27 01:52:00  100.0
    18  id1  2017-04-27 01:53:00    0.0
    19  id1  2017-04-27 01:54:00    0.0
    20  id1  2017-04-27 02:55:30    5.0
    21  id1  2017-04-27 02:56:00    6.0
    22  id1  2017-04-27 02:57:30    7.0
    23  id1  2017-04-27 02:58:00    8.0
    24  id1  2017-04-27 02:59:30    4.0

[更新]

我不知道如果有一个熊猫相当于这个列表理解,但也许它可以激励你:

valid = [df['value'][i:i+10].sum() >= 4 for i in range(len(df))]

df.iloc[start:valid.index(True, start+1)]

以上并不是你提出的问题,它会在前10个连续值小于4时停止。严格来说你问的更像是这样:

valid = [bool(df['value'][i:i+10].sum()) or value >= 4 
         for i, value in enumerate(df['value'])]

如果表现不是很糟糕,使用这可能比为了找到“纯熊猫”方法而不停地敲打头脑更好。