Question

在sparkR中，我有一个DataFrame data。当我输入head(data)时，我们得到此输出

  C0      C1               C2         C3
1 id user_id foreign_model_id machine_id 
2  1   3145                4         12 
3  2   4079                1          8 
4  3   1174                7          1    
5  4   2386                9          9    
6  5   5524                1          7

我想删除C0,C1,C2,C3因为他们以后会给我一些问题。例如，当我使用filter函数时：

filter(data,data$machine_id==1)

由于这个原因，

无法运行。

我已经阅读了这样的数据

data <- read.df(sqlContext, "/home/ole/.../data", "com.databricks.spark.csv")

Answer 1

SparkR将标头放入第一行并为DataFrame提供了一个新标头，因为标头选项的默认值为“false”。将header选项设置为header =“true”，然后您将不必处理此问题。

data <- read.df(sqlContext, "/home/ole/.../data", "com.databricks.spark.csv", header="true")

Answer 2

尝试

colnames(data) <- unlist(data[1,])
data <- data[-1,]
> data
#  id user_id foreign_model_id machine_id
#2  1    3145                4         12
#3  2    4079                1          8
#4  3    1174                7          1
#5  4    2386                9          9
#6  5    5524                1          7

如果您愿意，可以添加rownames(data) <- NULL以在删除第一行后更正行号。

执行此操作后，您可以选择与某些条件相对应的行，例如

subset(data, data$machine_id==1)
#  id user_id foreign_model_id machine_id
#4  3    1174                7          1

在基础R中，OP中建议的函数filter()是stats命名空间的一部分，通常用于分析时间序列。

数据

data <- structure(list(C0 = structure(c(6L, 1L, 2L, 3L, 4L, 5L), .Label = c("1", "2", "3", "4", "5", "id"), class = "factor"), C1 = structure(c(6L, 3L, 4L, 1L, 2L, 5L), .Label = c("1174", "2386", "3145", "4079", "5524", "user_id"), class = "factor"), C2 = structure(c(5L, 2L, 1L, 3L, 4L, 1L), .Label = c("1", "4", "7", "9", "foreign_model_id"), class = "factor"), C3 = structure(c(6L, 2L, 4L, 1L, 5L, 3L), .Label = c("1", "12", "7", "8", "9", "machine_id"), class = "factor")), .Names = c("C0", "C1", "C2", "C3"), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6"))

Answer 3

试试这个

names <- c()
for (i in seq(along = names(data))) {
   names <- c(names, toString(data[1,i]))
}

names(data) <- names
data <- data[-1,]

Answer 4

我根本无法使用答案，因为在sparkR中它无法运行：object of type 'S4' is not subsettable。我用这种方式解决了问题，但是，我认为有更好的方法来解决它。

data <- withColumnRenamed(data, "C0","id")
data <- withColumnRenamed(data, "C1","user_id")
data <- withColumnRenamed(data, "C2","foreign_model_id")
data <- withColumnRenamed(data, "C3","machine_id")

现在我可以按照自己的意愿成功使用filter功能。

删除DataFrame中的列名称

4 个答案: