从大型Pandas DataFrame中获取连续值为0的行的索引

时间:2019-07-09 13:52:18

标签: python pandas

我有一个包含每半小时能耗的数据集。我正在尝试获取长时间没有能源消耗的行的索引列表。换句话说,我正在尝试获取在特定列中包含连续值0的索引列表。我使用下面的代码,它似乎可以工作一段时间,但是随后它开始添加不为0的索引列表。

import more_itertools as mit

indices = df.loc[df[df.columns[2]] == df[df.columns[2]].isnull()].index.values.tolist()
outages_indices = [list(group) for group in mit.consecutive_groups(indices)]
long_outages_indices = []
for i in outages_indices:
    if len(i) >= 8:
        long_outages_indices.append(i)

例如,在849246行中,该值的确为0,但在1543677行中,该值为0.105,但仍属于列表的一部分。

DataFrame的前几行:

LCLid            tstp                           energy(kWh/hh)
MAC000002        2012-10-12 00:30:00.0000000    0.0
MAC000002        2012-10-12 01:00:00.0000000    0.0
MAC000002        2012-10-12 01:30:00.0000000    0.0
MAC000002        2012-10-12 02:00:00.0000000    0.0
MAC000002        2012-10-12 02:30:00.0000000    0.0

所需的输出(我已经知道了,但这是不正确的):

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ...],
 [861958, 861959, 861960, 861961 ...],
 [862015, 862016, 862017, 862018, ...], ...]

编辑:已解决。当我将多个CSV文件连接到一个Pandas DataFrame中时,当连接一个新文件时,索引编号将再次从0开始。我重置了索引编号,这解决了我的问题。

2 个答案:

答案 0 :(得分:0)

您想与groupby一起cumsum

df = pd.DataFrame({'energy':[1,0,0,0,1,1,0,0,0]})

# mark the non-zero
s = df.energy.ne(0)

# groupby
new_df = df.groupby([s, s.cumsum()]).apply(lambda x: list(x.index))

给您

energy  energy
False   1         [1, 2, 3]
        3         [6, 7, 8]
True    1               [0]
        2               [4]
        3               [5]
dtype: object

和那些感兴趣的索引是那些具有False 0级索引的索引。那是

new_df.loc[False]

给您

energy
1    [1, 2, 3]
3    [6, 7, 8]
dtype: object

答案 1 :(得分:0)

您的解决方案已经接近,但是我认为用于提取零能量索引的条件存在错误。你有:

. . .
indices = df.loc[df[df.columns[2]] == df[df.columns[2]].isnull()].index.values.tolist()
. . .

这是一种寻找零能量行索引的奇怪方法。

以下对我有用:

import pandas as pd
import more_itertools as mit

df = pd.DataFrame({'energy': [0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1]})

# find the indices with zero energy
indices = df.loc[df['energy'] == 0].index.values.tolist()

# extract long outages
threshold = 4  # minimum length for an outage to be considered "long"
outages_indices = [list(group) for group in mit.consecutive_groups(indices)]
long_outages_indices = [l for l in outages_indices if len(l) >= threshold]

如果您还想包含None的能量值,则可以执行以下操作:

import pandas as pd
import more_itertools as mit

df = pd.DataFrame({'energy': [0, None, 0, 0, 1, 0, 0, 1, 0, None, 0, None, 1]})
df = df.fillna(value=0)

# find the indices with zero energy
indices = df.loc[df['energy'] == 0].index.values.tolist()

# extract long outages
threshold = 4  # minimum length for an outage to be considered "long"
outages_indices = [list(group) for group in mit.consecutive_groups(indices)]
long_outages_indices = [l for l in outages_indices if len(l) >= threshold]