Question

我有以下数据集myDataSet：

sentence1    sentence2    lengthofsentence1   lengthOfsentence2    label
Thank you     Thanks             9                    6              1
Hello           Hi               5                    2              2
Goodbye         Bye              7                    3              3
Many thanks   Thanks to you      11                   13             1
    .            .               .                    .              .
    .            .               .                    .              .
    .            .               .                    .              .

我想使用SVM进行分类。我可以使用第3列和第4列创建我的训练集：

train_data <- myDataSet[3:4]
lables <- myDataSet[5]
train <- svm(train_data, lables, type = "C-classification")

但我想知道如何使用前四列创建训练集？事实上，我想使用第1,2,3,4列，其中两列是文本，其余是数字，用于创建训练集。我阅读了这个页面：SVM Tutorial: How to classify text in R，但它只适用于类型为文本的列。

Answer 1

有很多方法可以实现这一目标。考虑到您需要数值，您可以重新编码以实现它。重新编码后，您可以使用SVM。记住保持等价。

# using car package
library(car)

# Recode grade "Hello" to 1
myDataSet$sentence1 <- car::recode(myDataSet$sentence1,"Hello=1")


# using dplyr recode

dplyr::recode(myDataSet, Hello = 1)


# other dplyr way for all factor columns (beware the data types)
# return only the mutated columns

dplyr::mutate_if(myDataSet, is.factor, funs(as.numeric(interaction(., drop = TRUE))))


# preserve the original columns (useful for keeping a dictionnary for later use)
dplyr::mutate_if(myDataSet, is.factor, funs(new = as.numeric(interaction(., drop = TRUE))))


# using match
oldvalues <- c("Hello", "Hi")
newvalues <- factor(c("1","2")) 
myDataSet$sentence1 <- newvalues[ match(myDataSet$sentence1, oldvalues) ]

使用数据集的分类包含文本和数字

1 个答案: