最近,我开始阅读有关NLP的更多信息,以了解有关该主题的更多信息。现在,我正在尝试制定自己的分类算法(文本发送肯定/否定消息)时遇到的问题,与训练和测试数据集有关。我想使用管道,主要原因是我还想考虑文本中出现的否定词的数量。我有两个数据集,而我的方法涉及将两个数据集中的所有文本放在一起(在预处理之后),然后将语料库分为测试集和训练集,然后将它们合并在一起。
for /r
我这样做的原因(尽管我宁愿分别使用两个数据集):
for /R "C:\path\you\want" %%A IN (.) do (
if "%%A"=="Foldernameyouwant" rd Foldernameyouwant
是因为如果我从X_train和X_test(分别为y_train,y_test)开始创建,而不使用拆分功能:
datasetTrain = pd.read_csv('train.tsv', delimiter = '\t', quoting = 3)
datasetTrain['PN'].value_counts()
datasetTest = pd.read_csv('test.tsv', delimiter = '\t', quoting = 3)
datasetTest['PN'].value_counts()
corpus = []
y = []
# some preprocessing
y.append(posNeg)
corpus.append(text)
#...
class TextSelector(BaseEstimator, TransformerMixin):
def __init__(self, field):
self.field = field
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.field]
class NumberSelector(BaseEstimator, TransformerMixin):
def __init__(self, field):
self.field = field
def fit(self, X, y=None):
return self
def transform(self, X):
return X[[self.field]]
classifier = Pipeline([
('features', FeatureUnion([
('text', Pipeline([
('colext', TextSelector('processed')),
('cv', CountVectorizer(stop_words = stopwords,ngram_range=(1,1),min_df = 5, max_df=0.65)),
('tfidf', TfidfTransformer()),
])),
('negative', Pipeline([
('numbers', NumberSelector('posWords')),
('wscaler', StandardScaler()),
])),
('offensive', Pipeline([
('numbers', NumberSelector('offWords')),
('wscaler', StandardScaler()),
])),
])),
('clasif', RandomForestClassifier()),
])
classifier.fit(X_train, y_train)
pred = classifier.predict(X_test)
运行分类算法时出现错误:
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.11, random_state = 0)
如果使用TfidfVectorizer()在管道TfidfTransformer()中进行更改,则会出现此错误:
classifier.fit(X_train)
X_train = classifier.transform(X_train)
X_test = classifier.transform(X_test)
pred = classifier.predict(X_test)
我对此很陌生,我想知道是否有人可以指导我朝正确的方向发展?