我试图用这段代码合并开始和结束到Flag和Timestamp的两列:
print(df_DisponibilityAlarm.shape)
df_DisponibilityAlarm = (df_DisponibilityAlarm.stack()
.rename_axis([None, 'Flag'])
.reset_index(level=1, name='Timestamp'))
print(df_DisponibilityAlarm.shape)
结果是:
begin end
0 NaN 2019-10-21 07:48:28.272688
1 NaN 2019-10-21 07:48:28.449916
2 2019-10-21 07:48:26.740378 NaN
3 2019-10-21 07:48:26.923764 NaN
4 NaN 2019-10-21 07:48:41.689466
5 2019-10-21 07:48:37.306045 NaN
6 NaN 2019-10-21 07:58:00.774449
7 2019-10-21 07:57:59.223986 NaN
8 NaN 2019-10-21 08:32:37.004455
9 2019-10-21 08:32:35.755252 NaN
(13129, 2)
(13140, 2)
Flag Timestamp
0 end 2019-10-21 07:48:28.272688
1 end 2019-10-21 07:48:28.449916
2 begin 2019-10-21 07:48:26.740378
3 begin 2019-10-21 07:48:26.923764
4 end 2019-10-21 07:48:41.689466
5 begin 2019-10-21 07:48:37.306045
6 end 2019-10-21 07:58:00.774449
7 begin 2019-10-21 07:57:59.223986
8 end 2019-10-21 08:32:37.004455
9 begin 2019-10-21 08:32:35.755252
有效!但是当我仔细观察时,我看到当我使用“ stack()”时,行数增加了...我不明白为什么,请你解释一下吗?我需要这个来验证我的起始假设。
答案 0 :(得分:1)
从本质上讲,stack()
函数使您的数据集更长但,正如您在示例打印输出中看到的那样:
堆叠后不包含
包含NaN
的行。 <-------------
考虑以下测试案例,添加TEST_VALUE
:
begin,end
NaN,2019-10-21 07:48:28.272688
NaN,2019-10-21 07:48:28.449916
2019-10-21 07:48:26.740378,NaN
2019-10-21 07:48:26.923764,NaN
NaN,2019-10-21 07:48:41.689466
2019-10-21 07:48:37.306045,TEST_VALUE
NaN,2019-10-21 07:58:00.774449
2019-10-21 07:57:59.223986,NaN
NaN,2019-10-21 08:32:37.004455
2019-10-21 08:32:35.755252,NaN
df = pd.read_clipboard(sep=',')
print(df.shape)
print(df.stack().shape)
print(df.stack())
(10, 2)
(11, )
0 end 2019-10-21 07:48:28.272688
1 end 2019-10-21 07:48:28.449916
2 begin 2019-10-21 07:48:26.740378
3 begin 2019-10-21 07:48:26.923764
4 end 2019-10-21 07:48:41.689466
5 begin 2019-10-21 07:48:37.306045
end TEST_VALUE
6 end 2019-10-21 07:58:00.774449
7 begin 2019-10-21 07:57:59.223986
8 end 2019-10-21 08:32:37.004455
9 begin 2019-10-21 08:32:35.755252
# this part is strange
print(df.stack()[5])
begin 2019-10-21 07:48:37.306045
end TEST_VALUE
dtype: object
奇怪的是,索引5
与其余索引相比是多么奇怪,而对于为什么它是多索引类型而本系列中的其余索引却不是,我没有答案,但是让我们继续...
如您所见,这为TEST_VALUE
添加了额外的一行,并且再次将NaN
的任何值都不包含在堆叠版本中。
我将“标准化”或确保您的数据包含NaN
或日期,或使用NaN
用其他值填充fillna()
。
如果您希望按原样保留数据,则还可以在 WHERE (在我的示例中,其TEST_VALUE
)上将数据分块处理为零如此制服:
# read df in as an iterator with chunks
df1 = pd.read_clipboard(
sep=',',
chunksize=3 # change this to whatever chunksize you need
)
def test_chunks(df_iterator):
"""
Function that compares original df chunk size shape to stacked chunksize shape
Returns: original chunk where there is a mismatch in shapes
"""
for df_chunk in df_iterator:
original = df_chunk.shape[0]
stacked = df_chunk.stack().shape[0]
if original != stacked:
return df_chunk
bad_chunk = test_chunks(df1)
print(bad_chunk)
begin end
3 2019-10-21 07:48:26.923764 NaN
4 NaN 2019-10-21 07:48:41.689466
5 2019-10-21 07:48:37.306045 TEST_VALUE
因此,尽管由于无法访问您的完整数据而无法确定,但我的判断是这是数据一致性问题。
我希望这会有所帮助。