问题
假设给出了以下稀疏表,表明索引上的安全性列表。
identifier from thru
AAPL 1964-03-31 --
ABT 1999-01-03 2003-12-31
ABT 2005-12-31 --
AEP 1992-01-15 2017-08-31
KO 2014-12-31 --
ABT例如是从 1999-01-03 到 2003-12-31 的索引,再次来自 2005-12-31 直到今天( - 表示今天)。在它之间的时间里它没有列在索引上。
如何有效地将此稀疏表转换为以下形式的密集表
date AAPL ABT AEP KO
1964-03-31 1 0 0 0
1964-04-01 1 0 0 0
... ... ... ... ...
1999-01-03 1 1 1 0
1999-01-04 1 1 1 0
... ... ... ... ...
2003-12-31 1 1 1 0
2004-01-01 1 0 1 0
... ... ... ... ...
2017-09-04 1 1 0 1
在我的解决方案部分,您将找到解决问题的方法。不幸的是,代码似乎表现得非常糟糕。处理1648个条目大约需要22秒。
由于我是python的新手,我想知道如何有效地编写像这样的问题。
我不打算任何人向我提供解决我问题的方法(除非您希望这样做)。我的主要目标是了解如何在python中有效地解决这些问题。我使用pandas的功能来匹配相应的条目。我应该使用numpy和索引吗?我应该使用其他工具箱吗?如何获得性能改进?
请在下面的部分中找到我对此问题的处理方法(如果您感兴趣的话)。
非常感谢您的帮助
我的解决方案
我试图通过遍历第一个表中的每个行条目来解决问题。在每个循环中,我为特定的 from-thru -interval指定一个布尔矩阵,所有元素都设置为True。该矩阵附加到列表中。最后,我pd.concat列表并取消堆栈并重新索引生成的DataFrame。
import pandas as pd
import numpy as np
def get_ts_data(data, start_date, end_date, attribute=None, identifier=None, frequency=None):
"""
Transform sparse table to dense table.
Parameters
----------
data: pd.DataFrame
sparse table with minimal column specification ['identifier', 'from', 'thru'
start_date: pd.Timestamp, str
start date of the dense matrix
end_date: pd.Timestamp, str
end date of the dense matrix
attribute: str
column name of the value of the dense matrix.
identifier: str
column name of the identifier
frequency: str
frequency of the dense matrix
kwargs:
Allows to overwrite naming of 'from' and 'thru' variables.
e.g.
{'from': 'start', 'thru': 'end'}
Returns
-------
"""
if attribute is None:
attribute = ['on_index']
elif not isinstance(attribute, list):
attribute = [attribute]
if identifier is None:
identifier = ['identifier']
elif not isinstance(identifier, list):
identifier = [identifier]
if frequency is None:
frequency = 'B'
# copy data for security reasons
data_mod = data.copy()
data_mod['on_index'] = True
# specify start date and check type
if not isinstance(start_date, pd.Timestamp):
start_date = pd.Timestamp(start_date)
# specify end date and check type
if not isinstance(end_date, pd.Timestamp):
end_date = pd.Timestamp(end_date)
# specify output date range
date_range = pd.date_range(start_date, end_date, freq=frequency)
#overwrite null indicating that it is valid until today
missing = data_mod['thru'].isnull()
data_mod.loc[missing, 'thru'] = data_mod.loc[missing, 'from'].apply(lambda d: max(d, end_date))
# preallocate frms
frms = []
# add dataframe to frms with time specific entries
for index, row in data_mod.iterrows():
# date range index
d_range = pd.date_range(row['from'], row['thru'], freq=frequency)
# Multi index with date and identifier
d_index = pd.MultiIndex.from_product([d_range] + [[x] for x in row[identifier]], names=['date'] + identifier)
# add DataFrame with repeated values to list
frms.append(pd.DataFrame(data=np.repeat(row[attribute].values, d_index.size), index=d_index, columns=attribute))
out_frame = pd.concat(frms)
out_frame = out_frame.unstack(identifier)
out_frame = out_frame.reindex(date_range)
return out_frame
if __name__ == "__main__":
data = pd.DataFrame({'identifier': ['AAPL', 'ABT', 'ABT', 'AEP', 'KO'],
'from': [pd.Timestamp('1964-03-31'),
pd.Timestamp('1999-01-03'),
pd.Timestamp('2005-12-31'),
pd.Timestamp('1992-01-15'),
pd.Timestamp('2014-12-31')],
'thru': [np.nan,
pd.Timestamp('2003-12-31'),
np.nan,
pd.Timestamp('2017-08-31'),
np.nan]
})
transformed_data = get_ts_data(data, start_date='1964-03-31', end_date='2017-09-04', attribute='on_index', identifier='identifier', frequency='B')
print(transformed_data)
答案 0 :(得分:2)
# Ensure dates are Pandas timestamps.
df['from'] = pd.DatetimeIndex(df['from'])
df['thru'] = pd.DatetimeIndex(df['thru'].replace('--', np.nan))
# Get sorted list of all unique dates and create index for full range.
dates = sorted(set(df['from'].tolist() + df['thru'].dropna().tolist()))
dti = pd.DatetimeIndex(start=dates[0], end=dates[-1], freq='B')
# Create new target dataframe based on symbols and full date range. Initialize to zero.
df2 = pd.DataFrame(0, columns=df['identifier'].unique(), index=dti)
# Find all active symbols and set their symbols' values to one from their respective `from` dates.
for _, row in df[df['thru'].isnull()].iterrows():
df2.loc[df2.index >= row['from'], row['identifier']] = 1
# Find all other symbols and set their symbols' values to one between their respective `from` and `thru` dates.
for _, row in df[df['thru'].notnull()].iterrows():
df2.loc[(df2.index >= row['from']) & (df2.index <= row['thru']), row['identifier']] = 1
>>> df2.head(3)
AAPL ABT AEP KO
1964-03-31 1 0 0 0
1964-04-01 1 0 0 0
1964-04-02 1 0 0 0
>>> df2.tail(3)
AAPL ABT AEP KO
2017-08-29 1 1 1 1
2017-08-30 1 1 1 1
2017-08-31 1 1 1 1
>>> df2.loc[:'2004-01-02', 'ABT'].tail()
2003-12-29 1
2003-12-30 1
2003-12-31 1
2004-01-01 0
2004-01-02 0
Freq: B, Name: ABT, dtype: int64
>>> df2.loc['2005-12-30':, 'ABT'].head(3)
2005-12-30 0
2006-01-02 1
2006-01-03 1
Freq: B, Name: ABT, dtype: int64