Question

我有一个大型数据集，其中包含大约109G数据，这些数据被分为245个.vsc.gz文件。我试图打开它们中的每一个，然后将它们全部加入。第一步是好的，我成功实现了我想要的，但是后来当我尝试连接它们时，收到了以下错误。

Traceback (most recent call last):
  File "CleanDataVR2.py", line 92, in <module>
    df_concated = pd.concat(dl)
  File "/share/apps/anaconda/2/2.5.0/lib/python2.7/site-packages/pandas/tools/merge.py", line 835, in concat
    return op.get_result()
  File "/share/apps/anaconda/2/2.5.0/lib/python2.7/site-packages/pandas/tools/merge.py", line 1025, in get_result
    concat_axis=self.axis, copy=self.copy)
  File "/share/apps/anaconda/2/2.5.0/lib/python2.7/site-packages/pandas/core/internals.py", line 4474, in concatenate_block_managers
    for placement, join_units in concat_plan]
  File "/share/apps/anaconda/2/2.5.0/lib/python2.7/site-packages/pandas/core/internals.py", line 4579, in concatenate_join_units
    concat_values = com._concat_compat(to_concat, axis=concat_axis)
  File "/share/apps/anaconda/2/2.5.0/lib/python2.7/site-packages/pandas/core/common.py", line 2741, in _concat_compat
    return np.concatenate(to_concat, axis=axis)
MemoryError

然后是我的代码。

dl= []
dtype_dict= {'#RIC': 'str', 'Date[G]':'str', 'Time[G]':'str', 'Price': 'float16', 'Volume': 'float16'}

for f in file_list:
    df=pd.read_csv(f, compression='gzip',header=0, sep=',', 
               quotechar='"',usecols= 
               ['#RIC','Date[G]','Time[G]','Price','Volume'],
               dtype=dtype_dict,low_memory=False)
    df= df.dropna(subset=['Price']) 
    df['Tage'] = df['Volume'].map(volume_filter)
    dl.append(df)
df_concated = pd.concat(dl)

名为volume_filter的函数只是我为获取名为Tage的列而编写的函数。由于这部分效果很好，所以我没有放在这里。每个数据帧具有以下相同的结构。

            #RIC      Date[G]       Time[G]    Price  Volume   Tage
0         URKAq.L  01-SEP-2008  11:19:46.800   45.000   152.0   T200
8         URKAq.L  01-SEP-2008  11:28:53.769   45.000  2848.0  T3000
9         URKAq.L  01-SEP-2008  11:28:53.769   45.000  1725.0  T2000
11        URKAq.L  01-SEP-2008  11:28:53.844   45.000   427.0   T500
13        URKAq.L  01-SEP-2008  11:28:53.898   45.000   450.0   T500
15        URKAq.L  01-SEP-2008  11:28:53.981   45.000   200.0   T200
20        URKAq.L  01-SEP-2008  11:28:54.124   45.000   850.0   T900
21        URKAq.L  01-SEP-2008  11:28:54.124   45.000  1073.0  T2000
24        URKAq.L  01-SEP-2008  11:28:54.329   45.000   200.0   T200
25        URKAq.L  01-SEP-2008  11:28:54.617   44.965   310.0   T400
26        URKAq.L  01-SEP-2008  11:28:54.617   44.965   310.0   T400
29        URKAq.L  01-SEP-2008  11:29:04.522   45.025   620.0   T700
30        URKAq.L  01-SEP-2008  11:29:04.769   45.025   620.0   T700
31        URKAq.L  01-SEP-2008  11:30:21.974   45.000  2800.0  T3000
32        URKAq.L  01-SEP-2008  11:30:21.974   45.000   700.0   T700
35        URKAq.L  01-SEP-2008  11:30:22.036   45.000   679.0   T700
39        URKAq.L  01-SEP-2008  11:30:22.110   45.000   200.0   T200
40        URKAq.L  01-SEP-2008  11:30:22.114   45.000   250.0   T300
42        URKAq.L  01-SEP-2008  11:30:22.405   45.025   243.0   T300
43        URKAq.L  01-SEP-2008  11:30:22.663   45.025   243.0   T300
44        URKAq.L  01-SEP-2008  11:30:23.737   45.000  2550.0  T3000
47        URKAq.L  01-SEP-2008  11:30:23.769   45.000  1500.0  T2000
51        URKAq.L  01-SEP-2008  11:30:23.769   44.920   200.0   T200
52        URKAq.L  01-SEP-2008  11:30:23.856   44.900   150.0   T200
54        URKAq.L  01-SEP-2008  11:30:25.101   45.000  1901.0  T2000
56        URKAq.L  01-SEP-2008  11:30:25.145   44.900   650.0   T700
58        URKAq.L  01-SEP-2008  11:30:25.145   44.900   200.0   T200
64        URKAq.L  01-SEP-2008  11:30:37.648   44.950   195.0   T200
65        URKAq.L  01-SEP-2008  11:30:37.829   44.950   195.0   T200
68        URKAq.L  01-SEP-2008  11:30:47.743   44.950  1031.0  T2000

我尝试了@B的一种解决方案。 M。

def byhand(dfs):
    mtot=0
    with open('df_all.bin','wb') as f:
        for df in dfs:
            m,n =df.shape
            mtot += m
            f.write(df.values.tobytes())
            typ=df.values.dtype                
    #del dfs
    with open('df_all.bin','rb') as f:
        buffer=f.read()
        data=np.frombuffer(buffer,dtype=typ).reshape(mtot,n)
        df_all=pd.DataFrame(data=data,columns=list(range(n))) 
    os.remove('df_all.bin')
    return df_all

但是我仍然遇到以下错误。

Traceback (most recent call last):
  File "CleanDataVR2.py", line 107, in <module>
    byhand(dl)
  File "CleanDataVR2.py", line 102, in byhand
    data=np.frombuffer(buffer,dtype=typ).reshape(mtot,n)
ValueError: cannot create an OBJECT array from memory buffer

现在我真的很困惑。我已经在超级计算机中用2个文件运行了自己的代码，一切都很有趣。为什么我有更多文件错误？另外我已经用了5 * 256GB的内存，应该足够了。

尝试从read_csv连接巨大的数据帧时出现MemoryError

0 个答案: