Question

我读了this数据集，我想加入训练集和测试集的数据（我应该提到这是课程练习的一部分）。

我已经读取了两个数据集并给出了所有列名称，训练数据有7352行和562列，测试集有2947行和562列。两个数据集的列名称相同。

当我尝试使用bind_rows连接数据时，我得到一个包含10299行但有478列但不是562的数据集。

当我使用rbind时，我得到了正确的结果，但我需要使用tbl_df再次强制转换，所以我更喜欢使用bind_rows。

以下是我编写的脚本，从包含上述解压缩数据的文件夹（例如文件夹“UCI HAR Dataset”）运行它来重现问题。

## Setting the script folder to be current directory 
CurrentScriptDirectory = script.dir <- dirname(sys.frame(1)$ofile)
setwd(CurrentScriptDirectory)

library(dplyr)

#Readin the data
train_x <- tbl_df(read.table("./UCI HAR Dataset/train/X_train.txt"))
train_y <- tbl_df(read.table("./UCI HAR Dataset/train/y_train.txt"))
test_x <- tbl_df(read.table("./UCI HAR Dataset/test/X_test.txt"))
test_y <- tbl_df(read.table("./UCI HAR Dataset/test/y_test.txt"))

#Giving the y's proper names
colnames(train_y) <- c("Activity Name")
colnames(test_y) <- c("Activity Name")

#Reading features names
featuerNames<-read.table("./UCI HAR Dataset/features.txt")
featuerNames<-featuerNames[,2]

#Giving the training and test data proper names
colnames(train_x) <- featuerNames
colnames(test_x) <- featuerNames

labeledTrainingSet <- bind_cols(train_x,train_y)
labeledTestSet <- bind_cols(test_x,test_y)

labledDataSet <- bind_rows(labeledTrainingSet,labeledTestSet)

有人能帮我理解我做错了什么吗？

Answer 1

我使用过该数据集并遇到了同样的问题。正如其他人提到的，有一些重复的功能。

重命名重复列并使其合法。您可以使用：

make.names(X, unique = TRUE, allow_ = TRUE)

其中X是字符向量。该函数将添加到现有列名称，因此您不会丢失原始术语。有关详细信息，请参阅http://www.inside-r.org/r-doc/base/make.names

在所有列名称都是唯一的之后，dplyr :: bind_rows（）将起作用！

Answer 2

刚检查出来。您在featureNames集中有重复的名称。这些被bind_rows删除。

test1<- data.frame(c(1,2,3),c(1,NA,3), c(1,2,NA))
names(test1)<- c("A","B","B")

test2<- data.frame(c(1,2,3),c(1,NA,3), c(1,2,NA))
names(test2)<- c("A","B","B")


test3 <-bind_rows(test1, test2)

bind_rows绑定行 - 但缺少某些列

2 个答案: