我正在尝试将Dataframe的单行拆分为两行。在数据框的开始和结束列中。我想根据情况拆分行。
我有一个如下所示的数据框:
symbol,start,end,size
ABC,2015-08-27 18:00:00,2015-08-28 05:00:00,12
ABC,2015-11-20 02:00:00,2015-11-20 06:00:00,5
ABC,2016-01-22 03:00:00,2016-01-22 06:00:00,4
PQR,2016-02-12 02:00:00,2016-02-12 06:00:00,5
PQR,2016-02-12 22:00:00,2016-02-13 03:00:00,6
PQR,2016-02-12 02:00:00,2016-02-12 07:00:00,6
条件:
示例:让我们考虑这样的行:
PQR,2016-02-12 22:00:00,2016-02-13 03:00:00,6
在上面的行中,开始包含第12天,结束包含第13天,因此,需要将其分成两行,如下所示:
PQR,2016-02-12 22:00:00,2016-02-12 23:00:00,2
PQR,2016-02-12 00:00:00,2016-02-13 03:00:00,4
如果该行包含第12天开始和第14天结束之间的三天,则需要将其分成三行。
预期输出为:
symbol,start,end,size
ABC,2015-08-27 18:00:00,2015-08-27 23:00:00,6
ABC,2015-08-28 00:00:00,2015-08-28 05:00:00,6
ABC,2015-11-20 02:00:00,2015-11-20 06:00:00,5
ABC,2016-01-22 03:00:00,2016-01-22 06:00:00,4
PQR,2016-02-12 02:00:00,2016-02-12 06:00:00,5
PQR,2016-02-12 22:00:00,2016-02-12 23:00:00,2
PQR,2016-02-12 00:00:00,2016-02-13 03:00:00,4
PQR,2016-02-12 02:00:00,2016-02-12 07:00:00,6
答案 0 :(得分:1)
选项1
遍历行,并逐行附加一个新的DataFrame
。
import pandas as pd
import datetime
df2 = pd.DataFrame(columns=df.columns)
for (_,r) in df.iterrows():
while r['start'].date()<r['end'].date():
# create new row
newR = r.copy()
newR['end']=newR['start']
newR['end']=newR['end'].replace(hour=23)
newSize = 24-newR['start'].hour
newR['size']=newSize
# update row to process
r['start']=r['start']+datetime.timedelta(days=1)
r['start']=r['start'].replace(hour=0)
r['size'] = r['size'] - newSize
df2 = df2.append(newR)
df2 = df2.append(r)
df2.reset_index(drop=True, inplace=True)
选项2
如果原始Dataframe
中的行要在两天内拆分,请使用掩码和递归调用Dataframe
进行明智的操作。
import pandas as pd
import numpy as np
import datetime
def splitMultiDayRows(df):
mask = df['end'].dt.day>df['start'].dt.day
if np.any(mask):
df_new = df.loc[mask]
newSizes = 24-df.loc[mask,'start'].dt.hour
df.loc[mask,'end'] = df.loc[mask,'start']
df.loc[mask,'end'] = df.loc[mask,
'end'].apply(lambda x:
x.replace(hour=23))
df.loc[mask,'size'] = newSizes
df_new.loc[:,'start'] = df_new['start']+datetime.timedelta(days=1)
df_new.loc[:,'start'] = df_new['start'].apply(lambda x:
x.replace(hour=0))
df_new.loc[:,'size'] = df_new['size'] - newSizes
return pd.concat([df,splitMultiDayRows(df_new)])
else:
return df
与通话配合使用:
splitMultiDayRows(df.copy()).\
sort_values(['symbol','start']).\
reset_index(drop=True)
答案 1 :(得分:1)
此答案避免重复,并且不会复制不必要的行,因此可以节省时间和空间。
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
df2 = pd.DataFrame(columns=df.columns)
mask_to_change = df.apply(lambda x: x['end'].day > x['start'].day, axis=1)
for (_,r) in df[mask_to_change].iterrows():
while r['start'].date()<r['end'].date():
# create new row
newR = r.copy()
newR['end']=newR['start']
newR['end']=newR['end'].replace(hour=23)
newSize = 24-newR['start'].hour
newR['size']=newSize
# update row to process
r['start']=r['start']+datetime.timedelta(days=1)
r['start']=r['start'].replace(hour=0)
r['size'] = r['size'] - newSize
df2 = df2.append(newR)
df2 = df2.append(r)
df = pd.concat([df[~mask_to_change], df2])
df.sort_values(['symbol', 'start'], inplace=True)