机器学习:代表训练有素的第一个分类器预测第二个数据集

时间:2019-12-28 07:22:27

标签: python numpy machine-learning svm reshape

我是“机器学习”的新手,曾尝试实现this question,但我不清楚。我已经进行了2个月的引诱,所以请帮助我解决错误。

实际上,我正在尝试:

    形状的 TRAIN_dataset 中提取的 TRAIN_features TRAIN_labels 上的
  1. “ Train svm classifer” ,)和大小 98962
  2. 从另一个数据集中提取的 TEST_features 上的
  3. “测试svm分类器” ,该数据集是具有相同形状(98962,) TEST_dataset strong>,其大小为 98962 ,其值为 TRAIN_dataset

“ TRAIN_features” “ TEST_features” “预处理” 之后,借助“ TfidfVectorizer” 我将两个功能都矢量化了。之后,我再次计算了两个特征的形状和大小,即

vectorizer = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)
processed_TRAIN_features = vectorizer.fit_transform(processed_TRAIN_features)

“ processed_TRAIN_features” 大小变为 1032665 “形状” 变为(98962,9434)

vectorizer1 = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)
processed_TEST_features = vectorizer1.fit_transform(processed_TEST_features)

“ processed_TEST_features” 大小变为 1457961 “形状” 变为(98962,10782)

我知道我何时会在 processed_TRAIN_features “ TRAIN” svm分类器,以及何时“预测” “ processed_TEST_features” strong>使用相同的分类器,由于两个特征的“形状” “尺寸” 变得不同,因此会产生错误。

我认为,该问题的唯一解决方案是 processed_TEST_features processed_TRAIN_features “重塑” 稀疏矩阵(numpy.float64) ...我认为只有当它的大小小于“ processed_TEST_features” 时,才有可能重塑为“ processed_TRAIN_features” ,或者有其他方法可以实现上述几点(1,2 )。我无法针对我的问题实施this question,但仍在寻找它将如何变得与“ processed_TEST_features” w.r.t形状和大小相等。

请大家中的任何一个能为我做这件事……先谢谢。

完整代码如下:

DataPath2     = ".../train.csv"
TRAIN_dataset =   pd.read_csv(DataPath2)

DataPath1     = "..../completeDATAset.csv"
TEST_dataset  =   pd.read_csv(DataPath1)

TRAIN_features = TRAIN_dataset.iloc[:, 1 ].values
TRAIN_labels = TRAIN_dataset.iloc[:,0].values

TEST_features = TEST_dataset.iloc[:, 1 ].values
TEST_labeels = TEST_dataset.iloc[:,0].values
lab_enc = preprocessing.LabelEncoder()
TEST_labels = lab_enc.fit_transform(TEST_labeels)

processed_TRAIN_features = []

for sentence in range(0, len(TRAIN_features)):
    # Remove all the special characters
    processed_feature = re.sub(r'\W', ' ', str(TRAIN_features[sentence]))

    # remove all single characters
    processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)

    #remove special symbols
    processed_feature = re.sub(r'\s+[xe2 x80 xa6]\s+', ' ', processed_feature)

    # remove special symbols
    processed_feature = re.sub(r'\s+[xe2 x80 x98]\s+', ' ', processed_feature)

    # remove special symbols
    processed_feature = re.sub(r'\s+[xe2 x80 x99]\s+', ' ', processed_feature)

    # Remove single characters from the start
    processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature)

    # Substituting multiple spaces with single space
    processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)

    #remove links
    processed_feature = re.sub(r"http\S+", "", processed_feature)

    # Removing prefixed 'b'
    processed_feature = re.sub(r'^b\s+', '', processed_feature)

    #removing rt
    processed_feature = re.sub(r'^rt\s+', '', processed_feature)

    # Converting to Lowercase
    processed_feature = processed_feature.lower()

    processed_TRAIN_features.append(processed_feature)

vectorizer = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)
processed_TRAIN_features = vectorizer.fit_transform(processed_TRAIN_features)


processed_TEST_features = []

for sentence in range(0, len(TEST_features)):
    # Remove all the special characters
    processed_feature1 = re.sub(r'\W', ' ', str(TEST_features[sentence]))

    # remove all single characters
    processed_feature1 = re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature1)

    #remove special symbols
    processed_feature1 = re.sub(r'\s+[xe2 x80 xa6]\s+', ' ', processed_feature1)

    # remove special symbols
    processed_feature1 = re.sub(r'\s+[xe2 x80 x98]\s+', ' ', processed_feature1)

    # remove special symbols
    processed_feature1 = re.sub(r'\s+[xe2 x80 x99]\s+', ' ', processed_feature1)

    # Remove single characters from the start
    processed_feature1 = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature1)

    # Substituting multiple spaces with single space
    processed_feature1 = re.sub(r'\s+', ' ', processed_feature1, flags=re.I)

    #remove links
    processed_feature1 = re.sub(r"http\S+", "", processed_feature1)

    # Removing prefixed 'b'
    processed_feature1 = re.sub(r'^b\s+', '', processed_feature1)

    #removing rt
    processed_feature1 = re.sub(r'^rt\s+', '', processed_feature1)

    # Converting to Lowercase
    processed_feature1 = processed_feature1.lower()

    processed_TEST_features.append(processed_feature1)

vectorizer1 = TfidfVectorizer(min_df=7, max_df=0.8, sublinear_tf = True, use_idf=True)
processed_TEST_features = vectorizer1.fit_transform(processed_TEST_features)

X_train_data, X_test_data, y_train_data, y_test_data = train_test_split(processed_TRAIN_features, TRAIN_labels, test_size=0.3, random_state=0)

text_classifier = svm.SVC(kernel='linear', class_weight="balanced" ,probability=True ,C=1 , random_state=0)

text_classifier.fit(X_train_data, y_train_data)

text_classifier.predict(processed_TEST_features)

标题编辑:预测数据集分类=>预测数据集

1 个答案:

答案 0 :(得分:0)

processed_TRAIN_features = csr_matrix((processed_TRAIN_features),shape=(new row length,new column length))