Question

我有一个包含nans的pandas DataFrame个对象。我想为每列找到所有后续有效帧的块，并从这些块中找到第一个和最后一个索引。

示例数据：

[
  [ 1,nan],
  [ 2,nan],
  [ 3,nan],
  [ 4,3.0],
  [ 5,1.0],
  [ 6,4.0],
  [ 7,1.0],
  [ 8,5.0],
  [ 9,9.0],
  [10,2.0],
  [11,nan],
  [12,nan],
  [13,6.0],
  [14,5.0],
  [15,3.0],
  [16,5.0]
]

其中第一列是索引，第二列是我想要过滤的值。结果应该是

[(4,10), (13,16)]

出于性能原因，我想避免通过for循环手动迭代数据......

更新1：

另外两个标准：

值列中的有效值不必相等。他们可以在-inf和+ inf
我只需要有效块的第一个和最后一个索引，而不是中间的NaN块。

Answer 1

我认为你可以使用：

#set column names and set index by first column 
df.columns = ['idx', 'a']
df = df.set_index('idx')
#find groups
df['b'] = (df.a.isnull() != df.a.shift(1).isnull()).cumsum()
#remove NaN
df = df[df.a.notnull()].reset_index()
#aggregate first and last values of column idx  
df = df['idx'].groupby(df.b).agg(['first', 'last'])
print zip(df['first'], df['last'])
[(4, 10), (13, 16)]

然后我尝试修改cggarvey：

的解决方案

#set column names and set index by first column 
df.columns = ['idx', 'a']
df = df.set_index('idx')
#find edges 
pre =  df['a'] - df['a'].diff(-1)
pst = df['a'] - df['a'].diff(1)
a = pre.notnull() & pst.isnull()
z = pre.isnull() & pst.notnull()
print zip(a[a].index, z[z].index)
[(4, 10), (13, 16)]

Answer 2

以下是使用Numpy的示例。不确定它与@ jezrael的解决方案相比如何，但你提到性能是一项要求，所以你可以比较两者。

注意：这假设您的列名为“index”和“val”

import numpy as np

pre = np.array(df['val'] - df.diff(-1)['val'])
pst = np.array(df['val'] - df.diff(1)['val'])

a = np.where(~np.isnan(pre) & np.isnan(pst))
z = np.where(np.isnan(pre) & ~np.isnan(pst))
output = zip(df.ix[a[0]]['index'],df.ix[z[0]]['index'])

输出：

[(4, 10), (13, 16)]

Pandas：获取列中有效后续帧的第一个和最后一个索引

更新1：

2 个答案: