Question

我在pandas（python）中有一个数据帧，它是来自具有时间索引的实验的测量变量。我希望提取出这个值低于某个值的时间。然而，噪声有时会导致变量高于和低于阈值，所以如果变量超过另一个阈值，我也只想找到一个新的时间点。我到目前为止编写的代码是：

def findPriming(df,col,sphigh,splow):
    #start the counter and the pastPrime detector
    i = 1 # this ignores the first value but lets us check with the one before with no errors.
    currentlyPriming = False
    primeTimes = []
        #Right iteratre through the series here:
    while i < range(len(df)):
        # If the value is above 20, everything is fine and its not priming
        if df[col].iloc[i] > sphigh:
            currentlyPriming = False

        #If its below 16:
        elif df[col].iloc[i] < splow:
            #Check if we are currently priming:
            if not currentlyPriming:
                # We are now priming and haven't been before. So let's log it
                primeTimes.append(df.index[i])
            # Now we are priming we need to set the flag!
            currentlyPriming = True
        # Nowincrement the counter
        i += 1  # Increment counter

    return primeTimes

但是我可以想象这是非常低效的（事实上它将永远运行会告诉我同样的事情）。

我试图考虑如何删除两个if if的每个数据点，但无法使其工作。

有没有人对改进有任何想法？我试图搜索类似的代码，但似乎找不到任何东西。

编辑以包含我的数据框的示例：

DateTime                      Data
2013-08-08 15:46:41           25.203461
2013-08-08 15:46:51           23.241514
2013-08-08 15:47:01           22.256216
2013-08-08 15:47:11           21.256216
2013-08-08 15:47:21           16.261763
2013-08-08 15:47:31           13.249237
2013-08-08 15:47:41           17.249237
2013-08-08 15:47:51           18.238962
2013-08-08 15:48:01           13.207640
2013-08-08 15:48:11           20.207640

一个示例图表的链接我（严重）绘制[inlined --ed]

example image

Answer 1

IIUC你的情况正确，你想找到你低于16的时间，但只有在相同的低于20的时期内是新的下降。我可以通过几种方式来做到这一点。有些比以下更短，但这个技巧很有用，适用于许多问题，所以值得了解。

groupby + cumsum。

基本思想是使用groupby将时间分组为上一行以下的一组时间。不幸的是，对于我们的目的，groupby将组合不连续的组，但我们可以使用cumsum来解决这个问题。（也许groupby会增加contiguous=True/False标记默认为False以使其更容易..）

如果您将时间作为索引，那么

df = df.reset_index()
upper_limit = 20
lower_limit = 16
above_upper_line = df.Data > upper_limit
upper_line_crossed = above_upper_line != above_upper_line.shift()
clusters = upper_line_crossed.cumsum()
below_lower_line = df.Data < lower_limit

times = df[below_lower_line].groupby(clusters)["DateTime"].first().tolist()

产生

>>> times
array(['2013-08-08T11:47:31.000000000-0400'], dtype='datetime64[ns]')

[当我有机会时，我会试着写一个解释。]

Answer 2

修改使用您包含的图表，下面的解决方案过于简单。我会把它留在下面，因为我认为它将是一个更完整的方法的组成部分。

您不需要任何循环来执行此操作。您可以使用布尔（逻辑）索引。你的例子没有运行（我们没有你的任何数据），所以这是一个玩具示例：

In [1]: import numpy as np In [2]: import pandas In [3]: dateindex = pandas.DatetimeIndex(freq='10T', start='2013-11-11 06:30', end='2013-11-11 12:30') In [4]: df = pandas.DataFrame(np.random.normal(size=(len(dateindex),3)), columns=list('ABC'), index=dateindex) In [5]: df.head() Out[5]: A B C 2013-11-11 06:30:00 0.958990 0.234201 0.216744 2013-11-11 06:40:00 -2.173221 0.232468 0.696578 2013-11-11 06:50:00 -0.089300 2.081265 -0.482739 2013-11-11 07:00:00 -0.621272 0.226189 1.025683 2013-11-11 07:10:00 1.091428 -0.097205 -0.570189 In [6]: df[df['A'] < -1.0].index.tolist() Out[6]: [Timestamp('2013-11-11 06:40:00', tz=None), Timestamp('2013-11-11 09:20:00', tz=None), Timestamp('2013-11-11 09:30:00', tz=None), Timestamp('2013-11-11 10:40:00', tz=None), Timestamp('2013-11-11 11:00:00', tz=None), Timestamp('2013-11-11 12:20:00', tz=None)]

在这种情况下，我只使用-1.0的随机数据代替您示例中splow的位置。此外，'A'会映射到您的函数中的col。

如何在pandas中查找变量低于某个值的时间

2 个答案: