我在将一个函数应用于数据帧的每一行时遇到了一些问题。其中两列有int值,我试图得到一个具有每个列的范围的系列。数据为here,如下所示:
Unnamed: 0 FirstYear LastYear Here Change \
0 0 1990 2007 50930104.0 No
1 1 2001 2001 50850401.0 No
2 2 2001 2008 73590600.0 No
3 3 1999 2002 79299903.0 Yes
4 4 2002 2007 79299903.0 Yes
Industry \
0 Textile waste
1 Fasteners, industrial: nuts, bolts, screws, etc.
2 Party supplies rental services
3 Disc jockey service
4 Disc jockey service
IndustryGroup \
0 Miscellaneous Durable Goods
1 Machinery, Equipment, and Supplies
2 Misc. Equipment Rental & Leasing
3 Producers, Orchestras, Entertainers
4 Producers, Orchestras, Entertainers
Company SIC Sales \
0 CARLSONS MILLS INC ... 50930104.0 450000.0
1 LAWSON PRODUCTS ... 50850401.0 450000.0
2 HAWAIIAN GUY PTY RENTALS SUPS ... 73590600.0 150000.0
3 TROPICAL STORM PRODUCTION ... 59630000.0 16800.0
4 TROPICAL STORM PRODUCTION ... 79299903.0 100000.0
Emp Class YearsActive
0 14.0 not 18
1 3.0 not 1
2 3.0 not 8
3 1.0 des 4
4 7.0 not 6
如果我将整个数据帧读入内存,这很容易实现:
df = pd.read_csv('test.csv')
chunkrange = df.apply(lambda x: range(x.FirstYear, x.LastYear+1), reduce=True, axis=1)
chunkrange.head()
Out[29]:
0 [1990, 1991, 1992, 1993, 1994, 1995, 1996, 199...
1 [2001]
2 [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008]
3 [1999, 2000, 2001, 2002]
4 [2002, 2003, 2004, 2005, 2006, 2007]
dtype: object
然而,如果我把它分块,就会发生这种情况:
df = pd.read_csv('test.csv', chunksize=3)
for chunk in df:
chunkrange = chunk.apply(lambda x: range(x.FirstYear, x.LastYear+1), reduce=True, axis=1)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-25-81a985cd4630> in <module>()
1 for chunk in df:
----> 2 chunkrange = chunk.apply(lambda x: range(x.FirstYear, x.LastYear+1), reduce=True, axis=1)
3
C:\Users\jc4673\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\frame.pyc in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
4059 if reduce is None:
4060 reduce = True
-> 4061 return self._apply_standard(f, axis, reduce=reduce)
4062 else:
4063 return self._apply_broadcast(f, axis)
C:\Users\jc4673\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\frame.pyc in _apply_standard(self, func, axis, ignore_failures, reduce)
4172 index = None
4173
-> 4174 result = self._constructor(data=results, index=index)
4175 result.columns = res_index
4176
C:\Users\jc4673\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\frame.pyc in __init__(self, data, index, columns, dtype, copy)
222 dtype=dtype, copy=copy)
223 elif isinstance(data, dict):
--> 224 mgr = self._init_dict(data, index, columns, dtype=dtype)
225 elif isinstance(data, ma.MaskedArray):
226 import numpy.ma.mrecords as mrecords
C:\Users\jc4673\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\frame.pyc in _init_dict(self, data, index, columns, dtype)
358 arrays = [data[k] for k in keys]
359
--> 360 return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
361
362 def _init_ndarray(self, values, index, columns, dtype=None, copy=False):
C:\Users\jc4673\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\frame.pyc in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
5239 axes = [_ensure_index(columns), _ensure_index(index)]
5240
-> 5241 return create_block_manager_from_arrays(arrays, arr_names, axes)
5242
5243
C:\Users\jc4673\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\internals.pyc in create_block_manager_from_arrays(arrays, names, axes)
4002 return mgr
4003 except ValueError as e:
-> 4004 construction_error(len(arrays), arrays[0].shape, axes, e)
4005
4006
C:\Users\jc4673\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\internals.pyc in construction_error(tot_items, block_shape, axes, e)
3964 implied = tuple(map(int, [len(ax) for ax in axes]))
3965 if passed == implied and e is not None:
-> 3966 raise e
3967 if block_shape[0] == 0:
3968 raise ValueError("Empty data passed with indices specified.")
ValueError: could not broadcast input array from shape (5) into shape (13)