我有西班牙文字进行分类。我正在使用Python语言。我正在使用以下代码进行西班牙语文本分类。但这会降低准确性。
# For pre-prosessing using below code.
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import SnowballStemmer
from nltk import word_tokenize
cls_txt = 'some spanish text'
cls_txt = re.sub('[^a-zA-Z]', ' ', text)
cls_txt = cls_txt.lower()
cls_txt = cls_txt.split()
cls_txt = [word for word in cls_txt if not word in set(stopwords.words('spanish'))]
cls_txt = ' '.join(cls_txt)
stemmer = SnowballStemmer('spanish')
cls_txt = [stemmer.stem(word) for word in word_tokenize(cls_txt)]
cls_txt = ' '.join(cls_txt)
# For Classification using below code
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
这是正确的方法吗?请为我建议更好的西班牙文本分类代码和方法。 感谢您的进步。