如果我有一个包含三列的pd数据框:id
,start_time
,end_time
,我想将其转换为包含两列的pd.df:{{1 },id
e.g。从time
到[001, 1, 3][002, 3, 4]
目前,我正在使用for循环并在每次迭代中附加数据帧,但速度非常慢。有没有其他方法可以用来节省时间?
答案 0 :(得分:1)
如果start_time
和end_time
为timedelta
,请使用:
df = pd.DataFrame([['001', 1, 3],['002', 3, 4]],
columns=['id','start_time','end_time'])
print (df)
id start_time end_time
0 001 1 3
1 002 3 4
#stack columns
df1 = pd.melt(df, id_vars='id', value_name='time').drop('variable', axis=1)
#convert int to timedelta
df1['time'] = pd.to_timedelta(df1.time, unit='s')
df1.set_index('time', inplace=True)
print (df1)
id
time
00:00:01 001
00:00:03 002
00:00:03 001
00:00:04 002
#groupby by id and resample by one second
print (df1.groupby('id')
.resample('1S')
.ffill()
.reset_index(drop=True, level=0)
.reset_index())
time id
0 00:00:01 001
1 00:00:02 001
2 00:00:03 001
3 00:00:03 002
4 00:00:04 002
如果start_time
和end_time
为datetime
,请使用:
df = pd.DataFrame([['001', '2016-01-01', '2016-01-03'],
['002', '2016-01-03', '2016-01-04']],
columns=['id','start_time','end_time'])
print (df)
id start_time end_time
0 001 2016-01-01 2016-01-03
1 002 2016-01-03 2016-01-04
df1 = pd.melt(df, id_vars='id', value_name='time').drop('variable', axis=1)
#convert to datetime
df1['time'] = pd.to_datetime(df1.time)
df1.set_index('time', inplace=True)
print (df1)
id
time
2016-01-01 001
2016-01-03 002
2016-01-03 001
2016-01-04 002
#groupby by id and resample by one day
print (df1.groupby('id')
.resample('1D')
.ffill()
.reset_index(drop=True, level=0)
.reset_index())
time id
0 2016-01-01 001
1 2016-01-02 001
2 2016-01-03 001
3 2016-01-03 002
4 2016-01-04 002
答案 1 :(得分:0)
以下是我对你问题的看法:
df.set_index('id', inplace=True)
reshaped = df.apply(lambda x: pd.Series(range(x['start time'], x['end time']+1)), axis=1).\
stack().reset_index().drop('level_1', axis=1)
reshaped.columns = ['id', 'time']
reshaped
输入:
import pandas as pd
from io import StringIO
data = StringIO("""id,start time,end time
001, 1, 3
002, 3, 4""")
df = pd.read_csv(data, dtype={'id':'object'})
df.set_index('id', inplace=True)
print("In\n", df)
reshaped = df.apply(lambda x: pd.Series(range(x['start time'], x['end time']+1)), axis=1).\
stack().reset_index().drop('level_1', axis=1)
reshaped.columns = ['id', 'time']
print("Out\n", reshaped)
输出:
In
start time end time
id
001 1 3
002 3 4
Out
id time
0 001 1
1 001 2
2 001 3
3 002 3
4 002 4