管道和两个不同的数据集,用于在Python中进行文本分类

时间:2018-11-27 12:39:32

标签: python nlp pipeline text-classification tfidfvectorizer

最近,我开始阅读有关NLP的更多信息,以了解有关该主题的更多信息。现在,我正在尝试制定自己的分类算法(文本发送肯定/否定消息)时遇到的问题,与训练和测试数据集有关。我想使用管道,主要原因是我还想考虑文本中出现的否定词的数量。我有两个数据集,而我的方法涉及将两个数据集中的所有文本放在一起(在预处理之后),然后将语料库分为测试集和训练集,然后将它们合并在一起。

for /r

我这样做的原因(尽管我宁愿分别使用两个数据集):

for /R "C:\path\you\want" %%A IN (.) do (
     if "%%A"=="Foldernameyouwant" rd Foldernameyouwant

是因为如果我从X_train和X_test(分别为y_train,y_test)开始创建,而不使用拆分功能:

datasetTrain = pd.read_csv('train.tsv', delimiter = '\t', quoting = 3)
datasetTrain['PN'].value_counts()

datasetTest = pd.read_csv('test.tsv', delimiter = '\t', quoting = 3)
datasetTest['PN'].value_counts()

corpus = []
y = []

# some preprocessing
    y.append(posNeg)
    corpus.append(text)
#...

class TextSelector(BaseEstimator, TransformerMixin):
    def __init__(self, field):
        self.field = field
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.field]
class NumberSelector(BaseEstimator, TransformerMixin):
    def __init__(self, field):
        self.field = field
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[[self.field]]

classifier = Pipeline([
    ('features', FeatureUnion([
        ('text', Pipeline([
            ('colext', TextSelector('processed')),
            ('cv', CountVectorizer(stop_words = stopwords,ngram_range=(1,1),min_df = 5, max_df=0.65)),
            ('tfidf', TfidfTransformer()),
        ])),
        ('negative', Pipeline([
            ('numbers', NumberSelector('posWords')),
            ('wscaler', StandardScaler()),
        ])),
        ('offensive', Pipeline([
                ('numbers', NumberSelector('offWords')),
                ('wscaler', StandardScaler()),
        ])),
    ])),
    ('clasif', RandomForestClassifier()),
    ])

classifier.fit(X_train, y_train)
pred = classifier.predict(X_test)

运行分类算法时出现错误:

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.11, random_state = 0)

如果使用TfidfVectorizer()在管道TfidfTransformer()中进行更改,则会出现此错误:

classifier.fit(X_train)
X_train = classifier.transform(X_train)
X_test = classifier.transform(X_test)

pred = classifier.predict(X_test)

我对此很陌生,我想知道是否有人可以指导我朝正确的方向发展?

0 个答案:

没有答案