使用np.argwhere计算缺失值pandas dataframe

时间:2016-12-20 12:09:30

标签: python pandas dataframe np

我有这样的数据框:

RTD I
0 BA 32
1 BA 15
2 BA 22
3 BA 75
4 BA 28
5 BA 32 6 BA 7

现在,我想计算最小数量和最大连续行数,其中数字32不存在

代码是(参见:@MaxU):

len(x) - np.argwhere(x.I == 32).max() - 1
out = 1(它是对的)

len(x) - np.argwhere(x.I == 32).min() - 1
Out = 6(这是错误的,因为结果应该是4

我找到的解决方案是:

import pandas as pd
import numpy as np


df = pd.DataFrame({'RTD': ['BA']*7, 'I': [32, 15, 22, 75, 28, 32, 7]})
print (df )

用于计算最大最小延迟:

def rolling_count(val):
    if val == rolling_count.previous:
        rolling_count.count +=1
    else:
        rolling_count.previous = val
        rolling_count.count = 1
    return rolling_count.count
rolling_count.count = 0 #static variable
rolling_count.previous = None #static variable


df['count'] = df['I']==32
ddf= df['count'].apply(rolling_count)
print ('delay maximum',max(ddf))

DelayMinimum= len(df) - np.argwhere(df.I==32).max() - 1
print(DelayMinimum)

2 个答案:

答案 0 :(得分:0)

如果索引的编号为0到n-1,则只能选择值32,然后取索引的第一个差异。

np.diff(np.append(-2, df.query('I==32').index.values)) -1

我不了解第一个值,但这应该让你非常接近。

答案 1 :(得分:0)

有点强大的解决方案,但它的工作原理。我包含了整个代码,所以如果我误解了某些内容你就可以纠正我:

import pandas as pd
import numpy as np

df = pd.DataFrame({'RTD': ['BA']*7, 'I': [32, 15, 22, 75, 28, 32, 7]})
occurrences = df[df['I'] == 32].min(axis=1).index.values
max_diff = 0
for i in range(len(occurrences)-1):
    curr_diff = occurrences[i + 1] - occurrences[i] - 1
    if  curr_diff > max_diff:
        max_diff = curr_diff

min_diff = len(df['I'])
occurrences = np.append(occurrences, min_diff - 1)

for i in range(len(occurrences)-1):
    curr_diff = occurrences[i + 1] - occurrences[i]
    if  curr_diff < min_diff:
        min_diff = curr_diff