Question

我有西班牙文字进行分类。我正在使用Python语言。我正在使用以下代码进行西班牙语文本分类。但这会降低准确性。

# For pre-prosessing using below code.
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import SnowballStemmer
from nltk import word_tokenize

cls_txt = 'some spanish text'    
cls_txt = re.sub('[^a-zA-Z]', ' ', text)    
cls_txt = cls_txt.lower()    
cls_txt = cls_txt.split()    
cls_txt = [word for word in cls_txt if not word in set(stopwords.words('spanish'))]    
cls_txt = ' '.join(cls_txt)    
stemmer = SnowballStemmer('spanish')    
cls_txt = [stemmer.stem(word) for word in word_tokenize(cls_txt)]    
cls_txt = ' '.join(cls_txt)

# For Classification using below code    
from sklearn.model_selection import train_test_split    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)    
from sklearn.naive_bayes import GaussianNB    
classifier = GaussianNB()    
classifier.fit(X_train, y_train)  
y_pred = classifier.predict(X_test)

这是正确的方法吗？请为我建议更好的西班牙文本分类代码和方法。感谢您的进步。

西班牙文字分类法

0 个答案: