如何减少XArray中重复行的数量?

时间:2019-03-14 12:40:55

标签: python dimension python-xarray

我想从此xarray中删除重复的行:

<xarray.QFDataArray (dates: 61, tickers: 4, fields: 6)>
array([[[ 4.9167,     nan, ...,  2.1695,     nan],
        [ 4.9167,     nan, ...,  2.1695,     nan],
        [ 4.9167,     nan, ...,  2.1695,     nan],
        [ 4.9167,     nan, ...,  2.1695,     nan]],
       [[ 5.    ,     nan, ...,  2.1333, 70.02  ],
        [ 5.    ,     nan, ...,  2.1333, 70.02  ],
        [ 5.    ,     nan, ...,  2.1333, 70.02  ],
        [ 5.    ,     nan, ...,  2.1333, 70.02  ]],
       ...,
       [[    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan]],
       [[    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan]]])
Coordinates:
  * tickers  (tickers) object BloombergTicker:0000630D US Equity ... BloombergTicker:0000630D US Equity
  * fields   (fields) <U27 'PX_LAST' 'BEST_PEG_RATIO' ... 'VOLATILITY_360D'
  * dates    (dates) datetime64[ns] 1995-06-30 1995-07-30 ... 2000-06-30

在上面的示例中,股票行情重复了4次。我的目标是获得看起来如下所示的输出:

<xarray.QFDataArray (dates: 61, tickers: 1, fields: 6)>
array([[[ 4.9167,     nan, ...,  2.1695,     nan],
       [ 5.    ,     nan, ...,  2.1333, 70.02  ],
       ...,
       [    nan,     nan, ...,     nan,     nan],
       [    nan,     nan, ...,     nan,     nan]]])
Coordinates:
  * tickers  (tickers) object BloombergTicker:0000630D US Equity
  * fields   (fields) <U27 'PX_LAST' 'BEST_PEG_RATIO' ... 'VOLATILITY_360D'
  * dates    (dates) datetime64[ns] 1995-06-30 1995-07-30 ... 2000-06-30

请注意,“ tickers”字段从4减少到1。

以下是代码(不包括库导入):

def _get_historical_data_cache():
    path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'cached_values_v2_clean.cache')
    data = cached_value(_get_historical_data_bloomberg, path) # data importation from cache memory, if not available, directly from a data provider
    return data

def _slice_by_ticker():
    tickers = _get_historical_data_cache().indexes['tickers']

    for k in tickers:
        slice = _get_historical_data_cache().loc[:, k, :]  # it gives me duplicated tickers.

从数据提供程序中,我得到一个3D数据数组(xarray),其尺寸如下:日期,行情指示器和字段。目标是按计划逐个“切片”此多维数据集,以我为例,逐个滴答作响,以便在每次迭代中获得一个表示每个对象的2D数据数组(或如上图所示的3D xarray)。代码及其相应的数据(日期和字段)。

这是xarray在第一次迭代中的样子(如上所示)。问题是唯一的代码重复了:

In[2]: slice
Out[2]: 
<xarray.QFDataArray (dates: 61, tickers: 4, fields: 6)>
array([[[ 4.9167,     nan, ...,  2.1695,     nan],
        [ 4.9167,     nan, ...,  2.1695,     nan],
        [ 4.9167,     nan, ...,  2.1695,     nan],
        [ 4.9167,     nan, ...,  2.1695,     nan]],
       [[ 5.    ,     nan, ...,  2.1333, 70.02  ],
        [ 5.    ,     nan, ...,  2.1333, 70.02  ],
        [ 5.    ,     nan, ...,  2.1333, 70.02  ],
        [ 5.    ,     nan, ...,  2.1333, 70.02  ]],
       ...,
       [[    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan]],
       [[    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan],
        [    nan,     nan, ...,     nan,     nan]]])
Coordinates:
  * tickers  (tickers) object BloombergTicker:0000630D US Equity ... BloombergTicker:0000630D US Equity
  * fields   (fields) <U27 'PX_LAST' 'BEST_PEG_RATIO' ... 'VOLATILITY_360D'
  * dates    (dates) datetime64[ns] 1995-06-30 1995-07-30 ... 2000-06-30

当我尝试Ryan提出的解决方案时,代码如下:

def _slice_by_ticker():
    tickers = _get_historical_data_cache().indexes['tickers']

    for k in tickers:
        slice = _get_historical_data_cache().loc[:, k, :]  # it gives me duplicated tickers.

        # get unique ticker values as numpy array
        unique_tickers = np.unique(slice.tickers.values)
        da_reindexed = slice.reindex(tickers=unique_tickers)

这是错误:

ValueError: cannot reindex or align along dimension 'tickers' because the index has duplicate values

感谢您的帮助! :)

2 个答案:

答案 0 :(得分:0)

听起来您想重新索引数据数组。 (请参阅xarray docs on reindexing。)

下面,我将假设da是原始数据数组的名称

import numpy as np
# get unique ticker values as numpy array
unique_tickers = np.unique(da.tickers.values)
da_reindexed = da.reindex(tickers=unique_tickers)

答案 1 :(得分:0)

找到答案。

首先,我尝试了这一点:

slice_clean = (slice[:, :1]).rename('slice_clean')
slice.reindex_like(slice_clean)

这给了我与上面所示相同的错误:

ValueError: cannot reindex or align along dimension 'tickers' because the index has duplicate values

然后,我尝试了以下方法:

slice = slice[:,:1]

成功了!

<xarray.QFDataArray (dates: 61, tickers: 1, fields: 6)>
array([[[ 4.9167,     nan, ...,  2.1695,     nan]],

       [[ 5.    ,     nan, ...,  2.1333, 70.02  ]],

       ...,

       [[    nan,     nan, ...,     nan,     nan]],

       [[    nan,     nan, ...,     nan,     nan]]])
Coordinates:
  * tickers  (tickers) object BloombergTicker:0000630D US Equity
  * fields   (fields) <U27 'PX_LAST' 'BEST_PEG_RATIO' ... 'VOLATILITY_360D'
  * dates    (dates) datetime64[ns] 1995-06-30 1995-07-30 ... 2000-06-30