Question

我有一个包含每半小时能耗的数据集。我正在尝试获取长时间没有能源消耗的行的索引列表。换句话说，我正在尝试获取在特定列中包含连续值0的索引列表。我使用下面的代码，它似乎可以工作一段时间，但是随后它开始添加不为0的索引列表。

import more_itertools as mit

indices = df.loc[df[df.columns[2]] == df[df.columns[2]].isnull()].index.values.tolist()
outages_indices = [list(group) for group in mit.consecutive_groups(indices)]
long_outages_indices = []
for i in outages_indices:
    if len(i) >= 8:
        long_outages_indices.append(i)

例如，在849246行中，该值的确为0，但在1543677行中，该值为0.105，但仍属于列表的一部分。

DataFrame的前几行：

LCLid            tstp                           energy(kWh/hh)
MAC000002        2012-10-12 00:30:00.0000000    0.0
MAC000002        2012-10-12 01:00:00.0000000    0.0
MAC000002        2012-10-12 01:30:00.0000000    0.0
MAC000002        2012-10-12 02:00:00.0000000    0.0
MAC000002        2012-10-12 02:30:00.0000000    0.0

所需的输出（我已经知道了，但这是不正确的）：

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ...],
 [861958, 861959, 861960, 861961 ...],
 [862015, 862016, 862017, 862018, ...], ...]

编辑：已解决。当我将多个CSV文件连接到一个Pandas DataFrame中时，当连接一个新文件时，索引编号将再次从0开始。我重置了索引编号，这解决了我的问题。

Answer 1

您想与groupby一起cumsum：

df = pd.DataFrame({'energy':[1,0,0,0,1,1,0,0,0]})

# mark the non-zero
s = df.energy.ne(0)

# groupby
new_df = df.groupby([s, s.cumsum()]).apply(lambda x: list(x.index))

给您

energy  energy
False   1         [1, 2, 3]
        3         [6, 7, 8]
True    1               [0]
        2               [4]
        3               [5]
dtype: object

和那些感兴趣的索引是那些具有False 0级索引的索引。那是

new_df.loc[False]

给您

energy
1    [1, 2, 3]
3    [6, 7, 8]
dtype: object

Answer 2

您的解决方案已经接近，但是我认为用于提取零能量索引的条件存在错误。你有：

. . .
indices = df.loc[df[df.columns[2]] == df[df.columns[2]].isnull()].index.values.tolist()
. . .

这是一种寻找零能量行索引的奇怪方法。

以下对我有用：

import pandas as pd
import more_itertools as mit

df = pd.DataFrame({'energy': [0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1]})

# find the indices with zero energy
indices = df.loc[df['energy'] == 0].index.values.tolist()

# extract long outages
threshold = 4  # minimum length for an outage to be considered "long"
outages_indices = [list(group) for group in mit.consecutive_groups(indices)]
long_outages_indices = [l for l in outages_indices if len(l) >= threshold]

如果您还想包含None的能量值，则可以执行以下操作：

import pandas as pd
import more_itertools as mit

df = pd.DataFrame({'energy': [0, None, 0, 0, 1, 0, 0, 1, 0, None, 0, None, 1]})
df = df.fillna(value=0)

# find the indices with zero energy
indices = df.loc[df['energy'] == 0].index.values.tolist()

# extract long outages
threshold = 4  # minimum length for an outage to be considered "long"
outages_indices = [list(group) for group in mit.consecutive_groups(indices)]
long_outages_indices = [l for l in outages_indices if len(l) >= threshold]

从大型Pandas DataFrame中获取连续值为0的行的索引

2 个答案: