Pandas查找不连续数据的方法

时间:2017-08-16 09:31:08

标签: python pandas numpy

我想找出pandas数据帧中哪些列具有不连续数据。通过"不连续"我的意思是在再次获得某些值之前,值会从某个值变为零。

[0,0,0,1,2,3,4,5,0,0,0] # continuous
[0,0,0,1,2,0,4,5,0,0,0] # not continuous

我已经设法实现了一些可以执行此操作的代码,使用for循环遍历数据帧的每一列。我在下面制作了一个工作片段来说明我的意思:

import numpy as np
import pandas as pd

def find_discontinuous(series):
    switch = 0
    for index,val in series.iteritems():
        # print(val, end=" ")
        if switch==0 and val==0:
            # print("still zero")
            continue
        elif switch==0 and val!=0:
            switch = 1
        if switch==1 and val==0:
            # print("back to zero")
            switch = 2
            continue
        if switch==2 and val!=0:
            # print("supposed to be zero")
            return "not continuous"
    return "continuous"

data = np.array([[0,1,2,3,4,5,0],
                 [0,1,2,0,4,5,0]])
df = pd.DataFrame(data,columns=list(range(7)),index=list(range(2))).transpose()

for column in df.columns:
    series = df.loc[:,column]
    res = find_discontinuous(series)
    print(column,res)

输出:

0 continuous
1 not continuous

我在某处读到,使用for循环迭代pandas数据帧可能不正确,因为它很慢。大熊猫将如何实现同样的目标?

2 个答案:

答案 0 :(得分:1)

您可以将df apply转换为Series,其中列名为索引,BooleanContinuous

df.apply(lambda y: not(any(map(lambda x: x[1] == 0 and x[0]>0 and x[2]>0, zip(reversed(y), reversed(y[:-1]), reversed(y[:-2]))))))

或者,您可以将您的功能与apply

一起使用
df.apply(find_discontinuous)
#0        continuous
#1    not continuous

答案 1 :(得分:1)

你只需要检查第一个零点之间的变化和最后一个零点之间的变化,之间没有零点:

def is_continuous(series):
    id_first_true = (series > 0).idxmax()
    id_last_true = (series > 0)[::-1].idxmax()
    return all((series>0).loc[id_first_true:id_last_true] == True)