我想找出pandas数据帧中哪些列具有不连续数据。通过"不连续"我的意思是在再次获得某些值之前,值会从某个值变为零。
[0,0,0,1,2,3,4,5,0,0,0] # continuous
[0,0,0,1,2,0,4,5,0,0,0] # not continuous
我已经设法实现了一些可以执行此操作的代码,使用for循环遍历数据帧的每一列。我在下面制作了一个工作片段来说明我的意思:
import numpy as np
import pandas as pd
def find_discontinuous(series):
switch = 0
for index,val in series.iteritems():
# print(val, end=" ")
if switch==0 and val==0:
# print("still zero")
continue
elif switch==0 and val!=0:
switch = 1
if switch==1 and val==0:
# print("back to zero")
switch = 2
continue
if switch==2 and val!=0:
# print("supposed to be zero")
return "not continuous"
return "continuous"
data = np.array([[0,1,2,3,4,5,0],
[0,1,2,0,4,5,0]])
df = pd.DataFrame(data,columns=list(range(7)),index=list(range(2))).transpose()
for column in df.columns:
series = df.loc[:,column]
res = find_discontinuous(series)
print(column,res)
输出:
0 continuous
1 not continuous
我在某处读到,使用for循环迭代pandas数据帧可能不正确,因为它很慢。大熊猫将如何实现同样的目标?
答案 0 :(得分:1)
您可以将df
apply
转换为Series
,其中列名为索引,Boolean
为Continuous
:
df.apply(lambda y: not(any(map(lambda x: x[1] == 0 and x[0]>0 and x[2]>0, zip(reversed(y), reversed(y[:-1]), reversed(y[:-2]))))))
或者,您可以将您的功能与apply
:
df.apply(find_discontinuous)
#0 continuous
#1 not continuous
答案 1 :(得分:1)
你只需要检查第一个零点之间的变化和最后一个零点之间的变化,之间没有零点:
def is_continuous(series):
id_first_true = (series > 0).idxmax()
id_last_true = (series > 0)[::-1].idxmax()
return all((series>0).loc[id_first_true:id_last_true] == True)