Question

我有220万个数据样本要归类为超过 7500个类别。我正在使用pandas和sckit-learn of python这样做。

以下是我的数据集

的示例

 dataset=pd.read_csv("trainset.csv",encoding = "ISO-8859-1",low_memory=False)
 dataset['description']=dataset['description'].str.replace('[^a-zA-Z]', ' ')
 dataset['description']=dataset['description'].str.replace('[\d]', ' ')
 dataset['description']=dataset['description'].str.lower()

 stop = stopwords.words('english')
 lemmatizer = WordNetLemmatizer()

  dataset['description']=dataset['description'].str.replace(r'\b(' + r'|'.join(stop) + r')\b\s*', ' ')
  dataset['description']=dataset['description'].str.replace('\s\s+',' ')
  dataset['description'] =dataset['description'].apply(word_tokenize)
  ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'
  POS_LIST = [NOUN, VERB, ADJ, ADV]
  for tag in POS_LIST:
  dataset['description'] = dataset['description'].apply(lambda x: 
  list(set([lemmatizer.lemmatize(item,tag) for item in x])))
  dataset['description']=dataset['description'].apply(lambda x : " ".join(x))


 countvec = CountVectorizer(min_df=0.0005)
 documenttermmatrix=countvec.fit_transform(dataset['description'])
 column=countvec.get_feature_names()

 y_train=dataset['category']
 y_train=dataset['category'].tolist()

 del dataset
 del stop
 del tag

以下是我遵循的步骤：

预处理
矢量表示

培训

model = XGBClassifier(silent=False,n_estimators=500,objective='multi:softmax',subsample=0.8)
model.fit(documenttermmatrix,y_train,verbose=True)

生成的documenttermmatrix将是scipy csr矩阵类型，具有 12k 特征和220万个样本。

对于训练我尝试使用xgboost sckit learn

OSError: [WinError 541541187] Windows Error 0x20474343

执行上述代码2-3分钟后，我收到了错误

{{1}}

我也尝试过sckit的Naive Bayes学习我的内存错误

问题

我使用了Scipy矩阵，它占用的内存非常少，而且我在执行xgboost或Naive bayes之前删除了所有未使用的对象，我使用的系统具有 128GB RAM ，但在训练时仍会出现内存问题

我是python的新手。我的代码中有什么问题吗？任何人都可以告诉我如何有效地使用记忆并继续前进？

Answer 1

我想我可以在你的代码中解释这个问题。操作系统错误似乎是：

“

ERROR_DS_RIDMGR_DISABLED
8263 (0x2047)

目录服务检测到分配相对标识符的子系统被禁用。当系统确定相关标识符（RID）的大部分已用尽时，这可以作为保护机制发生。

“通过https://msdn.microsoft.com/en-us/library/windows/desktop/ms681390

我认为您在代码的这一步中耗尽了很大一部分RID：

dataset['description'] = dataset['description'].apply(lambda x: 
list(set([lemmatizer.lemmatize(item,tag) for item in x])))

你在你的lambda中传递一个lemmatizer，但是lambdas是匿名的，所以看起来你可能在运行时制作了220万个这个lemmatizer的副本。

每当遇到内存问题时，您应该尝试将low_memory标志更改为true。

对评论的回应 -

我检查了Pandas文档，你可以在数据集['description']。apply（）之外定义一个函数，然后在对数据集['description']的调用中引用该函数.apply（）。这是我写这个函数的方法。

def lemmatize_descriptions(x):
return list(set([lemmatizer.lemmatize(item,tag) for item in x]))

然后，对apply（）的调用将是 -

dataset['description'] = dataset['description'].apply(lemmatize_descriptions)

Here is the documentation.

python中大数据集的文本分类

1 个答案: