我有一个23列和4,044行的数据集,如下所示:
+-----+-----+---------+---------+---------+--------+
| _id | _ts | metric1 | metric2 | metric3 | etc... |
+-----+-----+---------+---------+---------+--------+
| 1 | 300 | .01 | 10 | 1 | |
| 1 | 600 | .02 | 25 | 1 | |
| 1 | 900 | .07 | 47 | 1 | |
+-----+-----+---------+---------+---------+--------+
我想以这样一种方式对数据进行数据透视,即我可以将_ts
+ metric
组合起来,为回归模型目的制作一个列,例如: 300_metric1
,600_metric1
等
现在,如果我通过此函数放置数据帧:
def build_timeseries_features(df):
df['_ts'] = df['_ts'].astype(str)
df = df.set_index('_ts', append=True).stack().unstack(0).T
df.columns = df.columns.map('_'.join)
concat = pd.concat([df.iloc[[x]].dropna(1).reset_index(drop=True) for x in range(1, 26)], axis=1)
df = pd.concat([concat, df.iloc[[4032]].dropna(1).reset_index(drop=True)], axis=1) # gets the 14th day data by index
return df
我得到了以下回复,这正是我想要的:
+-----+-------------+-------------+-------------+--------+
| _id | 300_metric1 | 600_metric1 | 900_metric1 | etc... |
+-----+-------------+-------------+-------------+--------+
| 1 | .01 | .02 | .07 | |
+-----+-------------+-------------+-------------+--------+
这个问题是非常慢(时间分析显示这需要43.8秒),我需要能够在大约10,000 ids的数据集上执行此操作,所以~40,000行....所以这将需要大约5天才能完成
有关如何加快速度的想法吗?
答案 0 :(得分:1)
考虑使用pivot_table将数据帧从长格式转换为宽格式。但是,您需要的一个细微差别是选择前24个不同的值和最后一个值,您可以使用series
操作。下面是前两个值,根据您的需要进行更改。
import numpy as np
import pandas as pd
# REPRODUCIBLE EXAMPLE
df = pd.DataFrame({'_id': list(range(1,11))*5,
'_ts':[300 for i in range(10)] + [600 for i in range(10)] +
[900 for i in range(10)] + [1200 for i in range(10)] +
[1500 for i in range(10)],
'metric1': np.random.randn(50),
'metric2': np.random.randn(50),
'metric3': np.random.randn(50)})
# FIRST 2 AND LAST VALUES (SORTED IN _ts ORDER)
first2vals = pd.Series(df['_ts'].unique()).sort_values().tolist()[:2]
lastval = pd.Series(df['_ts'].unique()).sort_values().tolist()[-1]
# FILTER DATA FRAME BY ABOVE LISTS
df = df[df['_ts'].isin(first2vals + [lastval])]
# PIVOT DATA FRAME
pvtdf = df.pivot_table(index="_id", columns=['_ts'],
values=['metric1', 'metric2', 'metric3']).reset_index()
# EXTRACT NEW COLUMNS FROM HIERARCHICAL INDEX
newcols = [str(i[1])+'_'+str(i[0]) for i in pvtdf.columns[1:].values]
pvtdf.columns = pvtdf.columns.get_level_values(0)
pvtdf.columns = ['id'] + newcols
<强>输出强>
print(pvtdf.head())
# id 300_metric1 600_metric1 1500_metric1 300_metric2 600_metric2 1500_metric2 300_metric3 600_metric3 1500_metric3
# 0 1 -1.158317 1.677042 -0.763932 0.673375 -1.345052 -0.754341 -0.023793 -1.212369 1.566882
# 1 2 1.699644 0.700463 1.351290 -0.672567 -0.941611 0.739071 1.270882 0.079738 -1.272970
# 2 3 0.414411 -1.110571 0.744850 -0.822367 1.897526 -0.344387 -0.382097 0.631639 0.515618
# 3 4 0.744617 0.708938 -0.851571 -1.312690 1.817234 -1.084037 -1.253749 -1.554973 -0.162376
# 4 5 1.233120 0.569504 0.560808 0.437648 0.293689 0.675582 1.396155 0.210394 -0.504569
答案 1 :(得分:0)
我自己想出了一个黑客的方法。这运行在~20ms
def time_series_columns(df):
data_values = []
for x in df.columns:
data_values.append(df[x].values.tolist())
columns = []
for metric in df.columns.values:
for ts in np.arange(0, 7500, 300):
columns.append("{}_{}".format(ts, metric))
data = [[item for sublist in [listy[:25] for listy in data_values] for item in sublist]]
new_df = pd.DataFrame(data, columns=columns)
return new_df