我想编写一个函数来选择Dataframe的一部分,这样: 在输入中给出" first_non_zero_index" (在我们的例子中是行的索引)和值阈值(例如4),返回索引,称为" last_non_zero_index",这样df.loc [first_non_zero_index:last_index]将产生输出结果。此外,两个无零值之间只允许最多10个连续的零。
我非常感谢你的帮助。非常感激。 提前谢谢了。 卡罗
输入数据框:
id, ts,value,
id1,2017-04-27 01:35:30,0
id1,2017-04-27 01:36:30,0
id1,2017-04-27 01:37:00,0
id1,2017-04-27 01:38:00,0
id1,2017-04-27 01:39:00,0
id1,2017-04-27 01:40:00,0
id1,2017-04-27 01:41:00,0
id1,2017-04-27 01:42:00,0
id1,2017-04-27 01:43:00,0
id1,2017-04-27 01:44:00,0
id1,2017-04-27 01:45:30,4.0
id1,2017-04-27 01:46:00,99.0
id1,2017-04-27 01:47:30,100.0
id1,2017-04-27 01:48:30,100.0
id1,2017-04-27 01:49:30,100.0
id1,2017-04-27 01:50:30,100.0
id1,2017-04-27 01:51:30,100.0
id1,2017-04-27 01:52:00,100.0
id1,2017-04-27 01:53:00,0
id1,2017-04-27 01:54:00,0
id1,2017-04-27 02:55:30,5.0
id1,2017-04-27 02:56:00,6.0
id1,2017-04-27 02:57:30,7.0
id1,2017-04-27 02:58:00,8.0
id1,2017-04-27 02:59:30,4.0
id1,2017-04-27 02:00:30,0
id1,2017-04-27 02:01:30,0
id1,2017-04-27 02:02:00,0
id1,2017-04-27 02:03:00,0
id1,2017-04-27 02:04:00,0
id1,2017-04-27 02:05:00,0
id1,2017-04-27 02:06:00,0
id1,2017-04-27 02:07:00,0
id1,2017-04-27 02:08:00,0
id1,2017-04-27 02:09:00,0
id1,2017-04-27 02:10:00,0
id1,2017-04-27 02:11:00,0
id1,2017-04-27 02:12:30,4.0
id1,2017-04-27 02:13:00,99.0
id1,2017-04-27 02:14:30,1000.0
id1,2017-04-27 02:15:30,1000.0
id1,2017-04-27 02:16:30,1000.0
id1,2017-04-27 02:17:30,1000.0
id1,2017-04-27 02:18:30,1000.0
id1,2017-04-27 01:19:00,1000.0
id1,2017-04-27 02:20:00,0
id1,2017-04-27 02:20:00,0
id1,2017-04-27 02:21:00,0
id1,2017-04-27 02:22:30,5.0
id1,2017-04-27 02:23:00,6.0
id1,2017-04-27 02:24:30,7.0
id1,2017-04-27 02:25:00,8.0
id1,2017-04-27 02:26:30,4.0
id1,2017-04-27 02:27:30,0
id1,2017-04-27 02:28:00,0
id1,2017-04-27 02:29:00,0
id1,2017-04-27 02:30:00,0
id1,2017-04-27 02:31:00,0
id1,2017-04-27 02:32:00,0
id1,2017-04-27 02:33:00,0
id1,2017-04-27 02:34:00,0
id1,2017-04-27 02:35:00,0
id1,2017-04-27 02:36:00,0
id1,2017-04-27 02:37:00,0
输出数据帧:
id, ts,value,
id1,2017-04-27 01:45:30,4.0
id1,2017-04-27 01:46:00,99.0
id1,2017-04-27 01:47:30,100.0
id1,2017-04-27 01:48:30,100.0
id1,2017-04-27 01:49:30,100.0
id1,2017-04-27 01:50:30,100.0
id1,2017-04-27 01:51:30,100.0
id1,2017-04-27 01:52:00,100.0
id1,2017-04-27 01:53:00,0
id1,2017-04-27 01:54:00,0
id1,2017-04-27 02:55:30,5.0
id1,2017-04-27 02:56:00,6.0
id1,2017-04-27 02:57:30,7.0
id1,2017-04-27 02:58:00,8.0
id1,2017-04-27 02:59:30,4.0
答案 0 :(得分:2)
这可以让你走上正轨。它将接受输入DataFrame并返回一个输出DataFrame,该输出从第一个传递“阈值”的元素到传递它的最后一个元素。
import pandas as pd
df = pd.read_csv('data.csv')
def extractPartialDataframe(df, threshold):
indicesList = df[df.value >= threshold].index.tolist()
new_df = df.iloc[min(indicesList): max(indicesList) + 1]
new_df.reset_index(inplace=True)
return new_df
trimmedDF = extractPartialDataframe(df, 4)
maxConsecutiveZeros = 10
consecutives = trimmedDF["value"].groupby((trimmedDF["value"]!=trimmedDF["value"].shift(1)).cumsum()).transform('count')
tooManyConsecutiveZeros = trimmedDF[(trimmedDF["value"] == 0)&( consecutives > maxConsecutiveZeros)].index.tolist()
final_df = trimmedDF.iloc[:tooManyConsecutiveZeros[0]]
print (final_df)
OUTPUT
10 id1 2017-04-27 01:45:30 4.0
11 id1 2017-04-27 01:46:00 99.0
12 id1 2017-04-27 01:47:30 100.0
13 id1 2017-04-27 01:48:30 100.0
14 id1 2017-04-27 01:49:30 100.0
15 id1 2017-04-27 01:50:30 100.0
16 id1 2017-04-27 01:51:30 100.0
17 id1 2017-04-27 01:52:00 100.0
18 id1 2017-04-27 01:53:00 0.0
19 id1 2017-04-27 01:54:00 0.0
20 id1 2017-04-27 02:55:30 5.0
21 id1 2017-04-27 02:56:00 6.0
22 id1 2017-04-27 02:57:30 7.0
23 id1 2017-04-27 02:58:00 8.0
24 id1 2017-04-27 02:59:30 4.0
答案 1 :(得分:0)
我认为按阈值表示分隔符(阈值可能表示>=
或<=
操作,而我想您需要==
所需的输出,因为有些值都会降低在你的例子中高于4。)
找到第一个非零值:
start = (df['value'] != 0).tolist().index(True)
找到分隔符(在您的示例中为4.0):
fours = (df['value'] == 4).tolist()
在第一个非零之后,将第一个非零切片到第一个分隔符:
df.iloc[start:fours.index(True, start+1)+1]
输出应该与您的示例类似:
id ts value
10 id1 2017-04-27 01:45:30 4.0
11 id1 2017-04-27 01:46:00 99.0
12 id1 2017-04-27 01:47:30 100.0
13 id1 2017-04-27 01:48:30 100.0
14 id1 2017-04-27 01:49:30 100.0
15 id1 2017-04-27 01:50:30 100.0
16 id1 2017-04-27 01:51:30 100.0
17 id1 2017-04-27 01:52:00 100.0
18 id1 2017-04-27 01:53:00 0.0
19 id1 2017-04-27 01:54:00 0.0
20 id1 2017-04-27 02:55:30 5.0
21 id1 2017-04-27 02:56:00 6.0
22 id1 2017-04-27 02:57:30 7.0
23 id1 2017-04-27 02:58:00 8.0
24 id1 2017-04-27 02:59:30 4.0
[更新]
我不知道如果有一个熊猫相当于这个列表理解,但也许它可以激励你:
valid = [df['value'][i:i+10].sum() >= 4 for i in range(len(df))]
df.iloc[start:valid.index(True, start+1)]
以上并不是你提出的问题,它会在前10个连续值小于4时停止。严格来说你问的更像是这样:
valid = [bool(df['value'][i:i+10].sum()) or value >= 4
for i, value in enumerate(df['value'])]
如果表现不是很糟糕,使用这可能比为了找到“纯熊猫”方法而不停地敲打头脑更好。