泰坦尼克号Kaggle数据集朴素贝叶斯分类器错误R编程

时间:2017-12-28 19:31:32

标签: r machine-learning naivebayes kaggle

我正在尝试为Kaggle - Titanic数据集训练一个天真的贝叶斯分类器(URL- https://www.kaggle.com/c/titanic/data用于" train.csv"和#34; test.csv")。

到目前为止我提出的代码如下 -

library(e1071)

train_d <- read.csv("train.csv", stringsAsFactors = TRUE)

# columns chosen for training data-
# colnames(TD)  OR names(TD)
# "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch","Embarked"
train_data <- train_d[, c(2:3, 5:8, 12)]

# to find out which columns contain NA (missing values)-
colnames(train_data)[apply(is.na(train_data), 2, any)]

# mean(TD$age, na.rm = TRUE)    # to find mean of 'age' which contains 'NA'
# which(is.na(age))

# fill in missing value (NA) with mean of 'Age' column-
train_data$Age[which(is.na(train_data$Age))] <- mean(train_data$Age, na.rm = TRUE)

# check whether there are any existing NAs-
which(is.na(train_data$Age))
# OR-
colnames(train_data)[apply(is.na(train_data), 2, any)]


test_d <- read.csv("test.csv", stringsAsFactors = TRUE)

# columns chosen for training data-
# "Pclass", "Sex", "Age", "SibSp", "Parch", "Embarked"
test_data <- test_d[, c(2, 4:7, 11)]

# find out missing values (NA)-
colnames(test_data)[apply(is.na(test_data), 2, any)]

# fill in missing value (NA) with mean of 'Age' column-
test_data$Age[which(is.na(test_data$Age))] <- mean(test_data$Age, na.rm = TRUE)

# check whether there are any existing NAs-
which(is.na(train_data$Age))
# OR-
colnames(train_data)[apply(is.na(train_data), 2, any)]




# training a naive-bayes classifier-
titanic_nb <- naiveBayes(Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked, data = train_data)


# predict using trained naive-bayes classifier-
output <- predict(titanic_nb, test_data, type = "class")

然而,&#39;输出&#39;并不是真的包含任何东西。输出&#39;输出&#39;变量是 -

> output
factor(0)
Levels: 

出了什么问题?

谢谢!

1 个答案:

答案 0 :(得分:0)

Here is the answer:删除原始问题,以便进行网络缓存链接。

原因是该模型并不真正知道如何处理字符列,因为您可以看到是否运行data.matrix(test_data)

解决方案是首先将您的角色列转换为因子,确保列车和测试中的因子水平一致。

在旁注中,我建议从随机森林开始,因为它通常在没有任何参数调整的情况下表现良好,并且不关心变量的分布(与假设高斯分布的NB相反)。