我有数据框
site1 time1 site2 time2 site3 time3 site4 time4 site5 time5 ... time6 site7 time7 site8 time8 site9 time9 site10 time10 target
session_id
21669 56 2013-01-12 08:05:57 55.0 2013-01-12 08:05:57 NaN NaT NaN NaT NaN NaT ... NaT NaN NaT NaN NaT NaN NaT NaN NaT 0
54843 56 2013-01-12 08:37:23 55.0 2013-01-12 08:37:23 56.0 2013-01-12 09:07:07 55.0 2013-01-12 09:07:09 NaN NaT ... NaT NaN NaT NaN NaT NaN NaT NaN NaT 0
77292 946 2013-01-12 08:50:13 946.0 2013-01-12 08:50:14 951.0 2013-01-12 08:50:15 946.0 2013-01-12 08:50:15 946.0 2013-01-12 08:50:16 ... 2013-01-12 08:50:16 948.0 2013-01-12 08:50:16 784.0 2013-01-12 08:50:16 949.0 2013-01-12 08:50:17 946.0 2013-01-12 08:50:17 0
我需要在最后一次非NaN时间和第一次之间计算差异。
欲望输出(转换为第二个)
session_id diff
21669 0
54843 2013-01-12 09:07:09 - 2013-01-12 08:37:23 55.0
77292 4
我可以为每一对和下一次合并
df['diff1'] = df['time1'] - df['time2']
...
但是有没有办法更快地完成它?
答案 0 :(得分:2)
target
pd.MultiIndex
groupby
'session_id'
然后使用'first'
和'last'
获取第一个和最后一个非空值。pipe
可以方便地将结果传递给我减去的功能d = df.drop('target', 1)
a = d.columns.str.extract('([a-z]+)(\d+)', expand=True).values.T
mux = pd.MultiIndex.from_arrays([a[0], a[1].astype(int)])
d.columns = mux
for (c0, c1), col in d.iteritems():
if c0 == 'time':
d[(c0, c1)] = pd.to_datetime(col, errors='coerce')
f = lambda d: d['last'].sub(d['first']).dt.total_seconds()
d.time.stack().groupby('session_id').agg(['last', 'first']).pipe(f)
session_id
21669 0.0
54843 1786.0
77292 4.0
dtype: float64
答案 1 :(得分:2)
使用:
a = df.filter(like='time').notnull().iloc[:, ::-1].idxmax(1)
print (a)
0 time2
1 time4
2 time5
dtype: object
df['diff']= pd.Series(df.lookup(df.index,a),index=df.index)
.sub(df['time1'])
.dt.total_seconds()
print (df['diff'])
0 0.0
1 1786.0
2 4.0
Name: diff, dtype: float64
numpy alternative
:
A = df.filter(like='time')
b = len(A.columns) - A.notnull().values[:, ::-1].argmax(1) - 1
df['diff'] = pd.Series(A.values[np.arange(len(A)),b]).sub(df['time1']).dt.total_seconds()
print (df['diff'])
0 0.0
1 1786.0
2 4.0
Name: diff, dtype: float64
更一般的Ken Wei
解决方案 - 按iloc
选择第一列和最后一列:
df1 = df.filter(like='time')
df['diff']= df1.ffill(1).iloc[:, -1].sub(df1.iloc[:, 0]).dt.total_seconds()
print (df['diff'])
0 0.0
1 1786.0
2 4.0
Name: diff, dtype: float64
答案 2 :(得分:1)
在仅包含.ffill()
列的数据框中使用time
:
df['diff1'] = df.filter(like='time').ffill(axis = 1).time10 - df.time1