基于r中的NA值的组实例

时间:2015-06-15 14:04:58

标签: r file csv instance na

我正在阅读csv文件,不幸的是我的数据框有很多缺失值。一个小小的片段如下:

dataframe

df <- data.frame(Size= c(800, 850, 1100, 1200, 1000), 
                 Value= c(900, NA, 1300, 1100, NA),
                 Location= c(NA, 'midcity', 'uptown', NA, 'Lakeview'),
                 Num1 = c(2, NA, 3, 2, NA),
                 Num2 = c(2,3,3,1,2),
                 Rent= c('y', 'y', 'n', 'y', 'n'))

我想使用weka预测一些结果,但如果我缺少多个属性,我就无法做到。我知道我应该使用函数is.na,但我不确定它能以什么方式完成,因为到目前为止我只用它来进行求和和计数。

编辑: 例如,在这个文件中,我在5个实例中有4个缺少值。实例2和5共享相同的缺失属性(B和D),而实例1和4共享相同的缺失值(C)。我想要得到的是一个由这些实例组成的数据框,因此我可以将它们导出到文件中并单独对这些文件进行分析。输出的示例可以是

> A

A

> B

B

编辑2:

我想保存分裂,到目前为止我试过这个:

write.csv(split(temp, index), file = "C:/Users/Nikita/Desktop/splits.csv", row.names=FALSE)

但它将所有分裂写入一行。有没有办法将它们分开?

编辑3:

我的步骤是:

data <- read.csv("location")
index <- apply(is.na(data)*1, 1,paste, collapse = "")
s <- split(data, index)
lapply(s, function(x) {names(x) <- names(data);x})
big.data <- do.call(rbind, s)
write.csv(big.data, file = "location", row.names=FALSE)

我错过了什么吗?

2 个答案:

答案 0 :(得分:1)

df[!is.na(df$Value), ]
  Size Value Location Num1 Num2 Rent
1  800   900     <NA>    2    2    y
3 1100  1300   uptown    3    3    n
4 1200  1100     <NA>    2    1    y

And

df[is.na(df$Value), ]
  Size Value Location Num1 Num2 Rent
2  850    NA  midcity   NA    3    y
5 1000    NA Lakeview   NA    2    n

In the future, please create a reproducible example so that users do not have to create a data frame by hand from your question. Pictures are not as helpful.

Data

df <- data.frame(Size= c(800, 850, 1100, 1200, 1000), 
                 Value= c(900, NA, 1300, 1100, NA),
                 Location= c(NA, 'midcity', 'uptown', NA, 'Lakeview'),
                 Num1 = c(2, NA, 3, 2, NA),
                 Num2 = c(2,3,3,1,2),
                 Rent= c('y', 'y', 'n', 'y', 'n'))

To combine it all use lapply since split creates a list:

lapply(split(temp, index), write.csv, file = "C:/Users/Nikita/Desktop/splits.csv", row.names=FALSE)

With a for loop:

s <- split(temp, index)
for (i in 1:length(s)) {
  write.csv(s[i], file = paste0("C:/Users/Nikita/Desktop/", i, "splits.csv"), row.names=FALSE)
}

答案 1 :(得分:1)

Recreating your example data:

df <- data.frame(Size= c(800, 850, 1100, 1200, 1000), 
                 Value= c(900, NA, 1300, 1100, NA),
                 Location= c(NA, 'midcity', 'uptown', NA, 'Lakeview'),
                 Num1 = c(2, NA, 3, 2, NA),
                 Num2 = c(2,3,3,1,2),
                 Rent= c('y', 'y', 'n', 'y', 'n'))

Now, splitting your data according to the pattern of NA as you want:

# This generates an index with 1 for a column with NA and 0 otherwise
index <- apply(is.na(df)*1, 1,paste, collapse = "")

# This splits the data.frame according to the index
split(df, index)
$`000000`
  Size Value Location Num1 Num2 Rent
3 1100  1300   uptown    3    3    n

$`001000`
  Size Value Location Num1 Num2 Rent
1  800   900     <NA>    2    2    y
4 1200  1100     <NA>    2    1    y

$`010100`
  Size Value Location Num1 Num2 Rent
2  850    NA  midcity   NA    3    y
5 1000    NA Lakeview   NA    2    n

Notice that the first element "000000" comprises all the observations with complete cases. Then "001000" comprises all observations where column 3 (location) is missing. And so on.