我想处理unsupervised.py notebook中的数据。但是,每次启动它时,我的计算机几乎都死机了,内核似乎死了。它似乎是由于内存管理错误而生成的。特别是在执行以下功能的第3步时:
>>>train.shape
(130318, 4)
>>>len(dict_emb)
179862
>>>def process_data(train):
print("step 1")
train['sentences'] = train['context'].apply(lambda x: [item.raw for item in TextBlob(x).sentences])
print("step 2")
train["target"] = train.apply(get_target, axis = 1)
print("step 3")
train['sent_emb'] = train['sentences'].apply(
lambda x: [dict_emb[item][0]
if item in dict_emb
else np.zeros(4096) for item in x)
>>>train = process_data(train)
也许是内存问题?有在线解决方案吗?现在,我将尝试使用Google Collaboratory ...
也许可以将其转换为一个可以处理每包线段问题的循环?我的尝试:
for i in range(0,len(train.shape[0]-200,200)):
print(i)
train['sent_emb'] = train['sentences'].iloc[i,i+200].apply(
lambda x: [dict_emb[item][0]
if item in dict_emb
else np.zeros(4096) for item in x])
但这给了我几个错误:
step 1
step 2
step 3
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-26-d3e879a8c753> in <module>()
----> 1 train = process_data(train)
<ipython-input-25-7063894d5c9a> in process_data(train)
10 #train['sent_emb'] = train['sentences'].apply(lambda x: [dict_emb[item][0] if item in\
11 # dict_emb else np.zeros(4096) for item in x])
---> 12 train['quest_emb'] =[]
13 for i in range(0,len(train.shape[0]-200,200)):
14 print(i)
~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
3117 else:
3118 # set column
-> 3119 self._set_item(key, value)
3120
3121 def _setitem_slice(self, key, value):
~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/frame.py in _set_item(self, key, value)
3192
3193 self._ensure_valid_index(value)
-> 3194 value = self._sanitize_column(key, value)
3195 NDFrame._set_item(self, key, value)
3196
~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/frame.py in _sanitize_column(self, key, value, broadcast)
3389
3390 # turn me into an ndarray
-> 3391 value = _sanitize_index(value, self.index, copy=False)
3392 if not isinstance(value, (np.ndarray, Index)):
3393 if isinstance(value, list) and len(value) > 0:
~/Documents/programming/mybot/mybotenv/lib/python3.5/site-packages/pandas/core/series.py in _sanitize_index(data, index, copy)
3999
4000 if len(data) != len(index):
-> 4001 raise ValueError('Length of values does not match length of ' 'index')
4002
4003 if isinstance(data, ABCIndexClass) and not copy:
ValueError: Length of values does not match length of index