如何在不死内核的情况下处理数据?

时间:2018-08-11 12:36:31

标签: python optimization memory-management jupyter-notebook

我想处理unsupervised.py notebook中的数据。但是,每次启动它时,我的计算机几乎都死机了,内核似乎死了。它似乎是由于内存管理错误而生成的。特别是在执行以下功能的第3步时:

>>>train.shape
(130318, 4)
>>>len(dict_emb)
179862
>>>def process_data(train):

    print("step 1")
    train['sentences'] = train['context'].apply(lambda x: [item.raw for item in TextBlob(x).sentences])

    print("step 2")
    train["target"] = train.apply(get_target, axis = 1)

    print("step 3")
    train['sent_emb'] = train['sentences'].apply(
        lambda x: [dict_emb[item][0] 
        if item in dict_emb 
        else np.zeros(4096) for item in x)

>>>train = process_data(train)

也许是内存问题?有在线解决方案吗?现在,我将尝试使用Google Collaboratory ...

也许可以将其转换为一个可以处理每包线段问题的循环?我的尝试:

for i in range(0,len(train.shape[0]-200,200)):
    print(i)
    train['sent_emb'] = train['sentences'].iloc[i,i+200].apply(
        lambda x: [dict_emb[item][0] 
        if item in dict_emb 
        else np.zeros(4096) for item in x])      

但这给了我几个错误:

step 1
step 2
step 3

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-26-d3e879a8c753> in <module>()
----> 1 train = process_data(train)

<ipython-input-25-7063894d5c9a> in process_data(train)
     10     #train['sent_emb'] = train['sentences'].apply(lambda x: [dict_emb[item][0] if item in\
     11     #                                                       dict_emb else np.zeros(4096) for item in x])
---> 12     train['quest_emb'] =[]
     13     for i in range(0,len(train.shape[0]-200,200)):
     14         print(i)

~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
   3117         else:
   3118             # set column
-> 3119             self._set_item(key, value)
   3120 
   3121     def _setitem_slice(self, key, value):

~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/frame.py in _set_item(self, key, value)
   3192 
   3193         self._ensure_valid_index(value)
-> 3194         value = self._sanitize_column(key, value)
   3195         NDFrame._set_item(self, key, value)
   3196 

~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
   3389 
   3390             # turn me into an ndarray
-> 3391             value = _sanitize_index(value, self.index, copy=False)
   3392             if not isinstance(value, (np.ndarray, Index)):
   3393                 if isinstance(value, list) and len(value) > 0:

~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/series.py in _sanitize_index(data, index, copy)
   3999 
   4000     if len(data) != len(index):
-> 4001         raise ValueError('Length of values does not match length of ' 'index')
   4002 
   4003     if isinstance(data, ABCIndexClass) and not copy:

ValueError: Length of values does not match length of index

0 个答案:

没有答案