使用与pandas dataframe一起使用时的ValueError

时间:2016-12-07 02:19:13

标签: python pandas vectorization apply

我有一个df(stock_pairs)列出了股票之间的差价交易。它有2列,一列表示买入的股票,另一列表示卖出的股票。

buy sell
0   MSFT    MXIM
1   INTC    MXIM
2   AMZN    MXIM
3   NFLX    MXIM
4   BIIB    MXIM
5   GILD    MXIM
6   TEVA    MXIM
7   GDXJ    MXIM
8   SLAB    MXIM
9   NXPI    MXIM

stock_pairs.to_dict()的输出:

{'buy': {0: 'MSFT',
  1: 'INTC',
  2: 'AMZN',
  3: 'NFLX',
  4: 'BIIB',
  5: 'GILD',
  6: 'TEVA',
  7: 'GDXJ',
  8: 'SLAB',
  9: 'NXPI'},
 'sell': {0: 'MXIM',
  1: 'MXIM',
  2: 'MXIM',
  3: 'MXIM',
  4: 'MXIM',
  5: 'MXIM',
  6: 'MXIM',
  7: 'MXIM',
  8: 'MXIM',
  9: 'MXIM'}}

我有另一个数据框,其中包含我所在股票中所有股票的股票价格信息。

stock_price_df看起来像:

Stock dt  Price
0 MSFT 2015-12-31  -562.14
1 MSFT 2016-01-31  -701.18
2 MSFT 2016-02-29  -265.44
3 MSFT 2016-03-31  -42.62
4 MSFT 2016-04-30  -468.95
5 MSFT 2016-05-31  -549.94
6 MSFT 2016-06-30  80.84
7 MSFT 2016-07-31  -633.36
8 MSFT 2016-08-31  -1700.73
9 MSFT 2016-09-30  -229.40
10  MSFT 2016-10-31  996.27
11  MSFT 2016-11-30  117.01
12 MXIM 2015-12-31  56.44
13 MXIM 2016-01-31  -83.38
14 MXIM 2016-02-29  152.92
15 MXIM 2016-03-31  -48.93
16 MXIM 2016-04-30  387.37
17 MXIM 2016-05-31  -194.31
18 MXIM 2016-06-30  -332.07
19 MXIM 2016-07-31  303.43
20 MXIM 2016-08-31  55.33
21 MXIM 2016-09-30  -170.31
22 MXIM 2016-10-31  -411.65
23 MXIM 2016-11-30  -101.52

stock_price_df.to_dict()的输出:

    {'Stock': {0: 'MSFT',
  1: 'MSFT',
  2: 'MSFT',
  3: 'MSFT',
  4: 'MSFT',
  5: 'MSFT',
  6: 'MSFT',
  7: 'MSFT',
  8: 'MSFT',
  9: 'MSFT',
  10: 'MSFT',
  11: 'MSFT',
  10440: 'MXIM ',
  10441: 'MXIM ',
  10442: 'MXIM ',
  10443: 'MXIM ',
  10444: 'MXIM ',
  10445: 'MXIM ',
  10446: 'MXIM ',
  10447: 'MXIM ',
  10448: 'MXIM ',
  10449: 'MXIM ',
  10450: 'MXIM ',
  10451: 'MXIM '},
 'dt': {0: Timestamp('2015-12-31 00:00:00'),
  1: Timestamp('2016-01-31 00:00:00'),
  2: Timestamp('2016-02-29 00:00:00'),
  3: Timestamp('2016-03-31 00:00:00'),
  4: Timestamp('2016-04-30 00:00:00'),
  5: Timestamp('2016-05-31 00:00:00'),
  6: Timestamp('2016-06-30 00:00:00'),
  7: Timestamp('2016-07-31 00:00:00'),
  8: Timestamp('2016-08-31 00:00:00'),
  9: Timestamp('2016-09-30 00:00:00'),
  10: Timestamp('2016-10-31 00:00:00'),
  11: Timestamp('2016-11-30 00:00:00'),
  12: Timestamp('2015-12-31 00:00:00'),
  13: Timestamp('2016-01-31 00:00:00'),
  14: Timestamp('2016-02-29 00:00:00'),
  15: Timestamp('2016-03-31 00:00:00'),
  16: Timestamp('2016-04-30 00:00:00'),
  17: Timestamp('2016-05-31 00:00:00'),
  18: Timestamp('2016-06-30 00:00:00'),
  19: Timestamp('2016-07-31 00:00:00'),
  20: Timestamp('2016-08-31 00:00:00'),
  21: Timestamp('2016-09-30 00:00:00'),
  22: Timestamp('2016-10-31 00:00:00'),
  23: Timestamp('2016-11-30 00:00:00')},
 'Price': {0: -562.13999999999999,
  1: -701.18000000000029,
  2: -265.43999999999994,
  3: -42.620000000000012,
  4: -468.9500000000001,
  5: -549.94000000000005,
  6: 80.840000000000032,
  7: -633.36000000000013,
  8: -1700.7300000000002,
  9: -229.40000000000006,
  10: 996.26999999999998,
  11: 117.01000000000001,
  12: 56.439999999999998,
  13: -83.380000000000024,
  14: 152.91999999999996,
  15: -48.929999999999993,
  16: 387.37,
  17: -194.30999999999997,
  18: -332.07000000000011,
  19: 303.43000000000001,
  20: 55.330000000000013,
  21: -170.31,
  22: -411.64999999999998,
  23: -101.52}}

我有一个名为cal_stats_align_data的函数,运行方式如下:

A) stock_pair_datadump = stock_pairs.apply(cal_stats_align_data, axis=1, args=(stock_price_df))

它也可以像:

一样运行
B) stock_pair_datadump = cal_stats_align_data(stock_pairs.iloc[0], stock_price_df)

A#执行stock_pairs数据框中所有股票对的操作,而B#中的执行只执行一对。

函数cal_stats_align_data每对返回1行x 20列统计信息。

因此,输出基本上与stock_pairs中的行数相同,但数据列数为10列。

B#的执行工作正常。但是,当我尝试执行A#时(即在整个stock_pair Universe中),我收到以下错误:

ValueError: cannot copy sequence with size 20 to array axis with dimension 1

更多详情:

--------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
C:\Users\blahblah\Anaconda3\lib\site-packages\pandas\core\common.py in _asarray_tuplesafe(values, dtype)
   1403                 result = np.empty(len(values), dtype=object)
-> 1404                 result[:] = values
   1405             except ValueError:

ValueError: could not broadcast input array from shape (20) into shape (1)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-1427-1e0b85417edf> in <module>()
----> 1 miso_path_datadump = path_master_filtered[['src','snk']][0:2].apply(cal_stats_align_df, axis=1, args=(mcc_filtered_final, cost_filtered_final,                         dt.datetime(2017,1,1), 'Peak',0))

C:\Users\blahblah\Anaconda3\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   4059                     if reduce is None:
   4060                         reduce = True
-> 4061                     return self._apply_standard(f, axis, reduce=reduce)
   4062             else:
   4063                 return self._apply_broadcast(f, axis)

C:\Users\blahblah\Anaconda3\lib\site-packages\pandas\core\frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
   4172                 index = None
   4173 
-> 4174             result = self._constructor(data=results, index=index)
   4175             result.columns = res_index
   4176 

C:\Users\blahblah\Anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
    222                                  dtype=dtype, copy=copy)
    223         elif isinstance(data, dict):
--> 224             mgr = self._init_dict(data, index, columns, dtype=dtype)
    225         elif isinstance(data, ma.MaskedArray):
    226             import numpy.ma.mrecords as mrecords

C:\Users\blahblah\Anaconda3\lib\site-packages\pandas\core\frame.py in _init_dict(self, data, index, columns, dtype)
    358             arrays = [data[k] for k in keys]
    359 
--> 360         return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
    361 
    362     def _init_ndarray(self, values, index, columns, dtype=None, copy=False):

C:\Users\blahblah\Anaconda3\lib\site-packages\pandas\core\frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
   5234 
   5235     # don't force copy because getting jammed in an ndarray anyway
-> 5236     arrays = _homogenize(arrays, index, dtype)
   5237 
   5238     # from BlockManager perspective

C:\Users\blahblah\Anaconda3\lib\site-packages\pandas\core\frame.py in _homogenize(data, index, dtype)
   5544                 v = lib.fast_multiget(v, oindex.values, default=NA)
   5545             v = _sanitize_array(v, index, dtype=dtype, copy=False,
-> 5546                                 raise_cast_failure=False)
   5547 
   5548         homogenized.append(v)

C:\Users\blahblah\Anaconda3\lib\site-packages\pandas\core\series.py in _sanitize_array(data, index, dtype, copy, raise_cast_failure)
   2920             raise Exception('Data must be 1-dimensional')
   2921         else:
-> 2922             subarr = _asarray_tuplesafe(data, dtype=dtype)
   2923 
   2924     # This is to prevent mixed-type Series getting all casted to

C:\Users\blahblah\Anaconda3\lib\site-packages\pandas\core\common.py in _asarray_tuplesafe(values, dtype)
   1405             except ValueError:
   1406                 # we have a list-of-list
-> 1407                 result[:] = [tuple(x) for x in values]
   1408 
   1409     return result

ValueError: cannot copy sequence with size 20 to array axis with dimension 1

有什么想法可以解决吗?

谢谢。

0 个答案:

没有答案