ValueError:无法在Pandas Dataframe广播时应用

时间:2016-09-21 17:41:10

标签: python pandas

我在将一个函数应用于数据帧的每一行时遇到了一些问题。其中两列有int值,我试图得到一个具有每个列的范围的系列。数据为here,如下所示:

 Unnamed: 0  FirstYear  LastYear        Here Change  \
0           0       1990      2007  50930104.0    No    
1           1       2001      2001  50850401.0    No    
2           2       2001      2008  73590600.0    No    
3           3       1999      2002  79299903.0    Yes   
4           4       2002      2007  79299903.0    Yes   

                                           Industry  \
0                                     Textile waste   
1  Fasteners, industrial: nuts, bolts, screws, etc.   
2                    Party supplies rental services   
3                               Disc jockey service   
4                               Disc jockey service   

                         IndustryGroup  \
0          Miscellaneous Durable Goods   
1   Machinery, Equipment, and Supplies   
2     Misc. Equipment Rental & Leasing   
3  Producers, Orchestras, Entertainers   
4  Producers, Orchestras, Entertainers   

                                             Company         SIC     Sales  \
0  CARLSONS MILLS INC                            ...  50930104.0  450000.0   
1  LAWSON PRODUCTS                               ...  50850401.0  450000.0   
2  HAWAIIAN GUY PTY RENTALS SUPS                 ...  73590600.0  150000.0   
3  TROPICAL STORM PRODUCTION                     ...  59630000.0   16800.0   
4  TROPICAL STORM PRODUCTION                     ...  79299903.0  100000.0   

    Emp Class  YearsActive  
0  14.0   not           18  
1   3.0   not            1  
2   3.0   not            8  
3   1.0   des            4  
4   7.0   not            6  

如果我将整个数据帧读入内存,这很容易实现:

df = pd.read_csv('test.csv')

chunkrange = df.apply(lambda x: range(x.FirstYear, x.LastYear+1), reduce=True,  axis=1)

chunkrange.head()
Out[29]: 
0    [1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...
1                                               [2001]
2     [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008]
3                             [1999, 2000, 2001, 2002]
4                 [2002, 2003, 2004, 2005, 2006, 2007]
dtype: object

然而,如果我把它分块,就会发生这种情况:

df = pd.read_csv('test.csv', chunksize=3)
for chunk in df:
    chunkrange = chunk.apply(lambda x: range(x.FirstYear, x.LastYear+1), reduce=True,  axis=1)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-25-81a985cd4630> in <module>()
      1 for chunk in df:
----> 2     chunkrange = chunk.apply(lambda x: range(x.FirstYear, x.LastYear+1), reduce=True,  axis=1)
      3 

C:\Users\jc4673\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\frame.pyc in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   4059                     if reduce is None:
   4060                         reduce = True
-> 4061                     return self._apply_standard(f, axis, reduce=reduce)
   4062             else:
   4063                 return self._apply_broadcast(f, axis)

C:\Users\jc4673\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\frame.pyc in _apply_standard(self, func, axis, ignore_failures, reduce)
   4172                 index = None
   4173 
-> 4174             result = self._constructor(data=results, index=index)
   4175             result.columns = res_index
   4176 

C:\Users\jc4673\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\frame.pyc in __init__(self, data, index, columns, dtype, copy)
    222                                  dtype=dtype, copy=copy)
    223         elif isinstance(data, dict):
--> 224             mgr = self._init_dict(data, index, columns, dtype=dtype)
    225         elif isinstance(data, ma.MaskedArray):
    226             import numpy.ma.mrecords as mrecords

C:\Users\jc4673\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\frame.pyc in _init_dict(self, data, index, columns, dtype)
    358             arrays = [data[k] for k in keys]
    359 
--> 360         return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
    361 
    362     def _init_ndarray(self, values, index, columns, dtype=None, copy=False):

C:\Users\jc4673\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\frame.pyc in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
   5239     axes = [_ensure_index(columns), _ensure_index(index)]
   5240 
-> 5241     return create_block_manager_from_arrays(arrays, arr_names, axes)
   5242 
   5243 

C:\Users\jc4673\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\internals.pyc in create_block_manager_from_arrays(arrays, names, axes)
   4002         return mgr
   4003     except ValueError as e:
-> 4004         construction_error(len(arrays), arrays[0].shape, axes, e)
   4005 
   4006 

C:\Users\jc4673\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\internals.pyc in construction_error(tot_items, block_shape, axes, e)
   3964     implied = tuple(map(int, [len(ax) for ax in axes]))
   3965     if passed == implied and e is not None:
-> 3966         raise e
   3967     if block_shape[0] == 0:
   3968         raise ValueError("Empty data passed with indices specified.")

ValueError: could not broadcast input array from shape (5) into shape (13)

0 个答案:

没有答案