为什么当我使用stack()时,行数会增加?

时间:2020-03-17 00:05:03

标签: python pandas datetime timestamp

我试图用这段代码合并开始和结束到Flag和Timestamp的两列:

print(df_DisponibilityAlarm.shape)

df_DisponibilityAlarm = (df_DisponibilityAlarm.stack()
 .rename_axis([None, 'Flag'])
 .reset_index(level=1, name='Timestamp'))

print(df_DisponibilityAlarm.shape)

结果是:

                         begin                          end
0                          NaN  2019-10-21  07:48:28.272688
1                          NaN  2019-10-21  07:48:28.449916
2  2019-10-21  07:48:26.740378                          NaN
3  2019-10-21  07:48:26.923764                          NaN
4                          NaN  2019-10-21  07:48:41.689466
5  2019-10-21  07:48:37.306045                          NaN
6                          NaN  2019-10-21  07:58:00.774449
7  2019-10-21  07:57:59.223986                          NaN
8                          NaN  2019-10-21  08:32:37.004455
9  2019-10-21  08:32:35.755252                          NaN

(13129, 2)
(13140, 2)

    Flag                    Timestamp
0    end  2019-10-21  07:48:28.272688
1    end  2019-10-21  07:48:28.449916
2  begin  2019-10-21  07:48:26.740378
3  begin  2019-10-21  07:48:26.923764
4    end  2019-10-21  07:48:41.689466
5  begin  2019-10-21  07:48:37.306045
6    end  2019-10-21  07:58:00.774449
7  begin  2019-10-21  07:57:59.223986
8    end  2019-10-21  08:32:37.004455
9  begin  2019-10-21  08:32:35.755252

有效!但是当我仔细观察时,我看到当我使用“ stack()”时,行数增加了...我不明白为什么,请你解释一下吗?我需要这个来验证我的起始假设。

1 个答案:

答案 0 :(得分:1)

已编辑

从本质上讲,stack()函数使您的数据集更长,正如您在示例打印输出中看到的那样: 堆叠后不包含

包含NaN的行。 <-------------


考虑以下测试案例,添加TEST_VALUE

begin,end
NaN,2019-10-21  07:48:28.272688
NaN,2019-10-21  07:48:28.449916
2019-10-21  07:48:26.740378,NaN
2019-10-21  07:48:26.923764,NaN
NaN,2019-10-21  07:48:41.689466
2019-10-21  07:48:37.306045,TEST_VALUE
NaN,2019-10-21  07:58:00.774449
2019-10-21  07:57:59.223986,NaN
NaN,2019-10-21  08:32:37.004455
2019-10-21  08:32:35.755252,NaN

df = pd.read_clipboard(sep=',') 

print(df.shape)
print(df.stack().shape)
print(df.stack())

(10, 2)
(11, )
0  end      2019-10-21  07:48:28.272688
1  end      2019-10-21  07:48:28.449916
2  begin    2019-10-21  07:48:26.740378
3  begin    2019-10-21  07:48:26.923764
4  end      2019-10-21  07:48:41.689466
5  begin    2019-10-21  07:48:37.306045
   end                       TEST_VALUE
6  end      2019-10-21  07:58:00.774449
7  begin    2019-10-21  07:57:59.223986
8  end      2019-10-21  08:32:37.004455
9  begin    2019-10-21  08:32:35.755252

# this part is strange
print(df.stack()[5])
begin    2019-10-21  07:48:37.306045
end                       TEST_VALUE
dtype: object

奇怪的是,索引5与其余索引相比是多么奇怪,而对于为什么它是多索引类型而本系列中的其余索引却不是,我没有答案,但是让我们继续...

如您所见,这为TEST_VALUE添加了额外的一行,并且再次将NaN的任何值都不包含在堆叠版本中。

我将“标准化”或确保您的数据包含NaN或日期,或使用NaN用其他值填充fillna()

如果您希望按原样保留数据,则还可以在 WHERE (在我的示例中,其TEST_VALUE)上将数据分块处理为零如此制服:

# read df in as an iterator with chunks

df1 = pd.read_clipboard(
                    sep=',', 
                    chunksize=3  # change this to whatever chunksize you need
)

def test_chunks(df_iterator):
    """
    Function that compares original df chunk size shape to stacked chunksize shape
    Returns: original chunk where there is a mismatch in shapes
    """
    for df_chunk in df_iterator:
        original = df_chunk.shape[0]
        stacked = df_chunk.stack().shape[0]
        if original != stacked:
            return df_chunk

bad_chunk = test_chunks(df1)
print(bad_chunk)

                         begin                          end
3  2019-10-21  07:48:26.923764                          NaN
4                          NaN  2019-10-21  07:48:41.689466
5  2019-10-21  07:48:37.306045                   TEST_VALUE

因此,尽管由于无法访问您的完整数据而无法确定,但我的判断是这是数据一致性问题。

我希望这会有所帮助。