根据r

时间:2017-10-13 22:55:12

标签: r missing-data

好吧所以我有一点困境,我知道它必须有一个解决方案。 我有一个包含13列的数据表,但是我们只关注两个(Fare和pClass)。有1309行,1308有票价值,我想通过基于不同类(pClass)的平均值来找到该缺失值。所以我想要的是告诉R找到一个Fare = NA的行,读取pClass(1,2或3)中的值,然后找到该指定类的平均值然后替换Fare中的缺失值平均值

所以,我想总结一下你的任务,无论是勇敢还是善良,都能帮助我。我想找到一个缺失的值,弄清楚它是什么类,特别是缺少值类的平均值,并用正确的平均值替换该缺失值

使用它而不是仅仅找到丢失的行并读取它是一个更好的途径,当我在R中有多个缺失值时,我可以用正确的平均值替换,而不管决定列。

感谢您的时间,

-Dylan

好的,因为这是太具体了,无法回答原来的问题,继续新的计划男孩(和女孩,除了你知道你在说什么,你还想做什么)。所以!新计划是使3个变量对应于三个不同的pClasses(1,2和3)。每个pClass平均值(称为'em pClassAVG。(x),其中x = 1,2或3)然后让R找到票价的缺失值并用相应的pClass的pClass变量(平均值)替换它们 R的思维过程应该看起来像这样“好的,缺失的价值。什么是pClass?好吧它是2,所以我们应该用pClassAVG.2替换缺失值”

上次因为不包含我的代码而得到-1所以这里是

    setwd("C:/Users/Maker/Desktop/Data Science/Data/Dylan T/Titanic data")
titanic.train <- read.csv(file = "train.csv", stringsAsFactors = FALSE, header = TRUE)
titanic.test <- read.csv(file = "test.csv", stringsAsFactors = FALSE, header = TRUE)
# line one tells it where to look for data. line 2 & 3 tell it that hey we wanna manipulate this stuff
#the string as factors does makes the factors line up bc we are gonna clean the data sheets togeather
#the headers = true makes the computer understand that there are headers and to not count or read the 
#first line as data but as a title
#currently reads incorrectly

titanic.train$IsTrainSet <- TRUE
titanic.test$IsTrainSet <- FALSE
#makes a new collumb to tell us if it is the train set or test set

titanic.test$Survived <- NA
#makes a new collumb and fills it with NA to make the collumbs line up and have the same names

titanic.full <- rbind(titanic.train, titanic.test)
titanic.full[titanic.full$Embarked=='', "Embarked"] <- 'S'
#ended day 1 at 12 minutes

age.median <- median(titanic.full$Age, na.rm = TRUE)
#creates a variable called age.median and assings it the median of the age collumb excluding the missing values (if we included missing
#values it would break bc its adding an undefined numbe)
#this method is better for replacing data that can change for example real time data that changes over the couse of the day and your 
#data gets its info updated every so often thus eliminating the problem of missing values and an incorrect median.

titanic.full[is.na(titanic.full$Age), "Age"] <- age.median
#table(is.na(titanic.full$Age) counts the missing values in the collumb age of titanic.full and returns true if there are missing value

pClassAVG.1 <- median(titanic.full$Fare, na.rm = TRUE, titanic.full$Pclass == 1 )
pClassAVG.2 <- median(titanic.full$Fare, na.rm = TRUE, titanic.full$Pclass == 2 )

最后两行是我试图告诉它制作前面提到的pClassAVG.1和pClassAVG.2

2 个答案:

答案 0 :(得分:0)

df <- data_frame(Fare=c(10,20,30,40,50,60,NA,70,80), pClass=c(1,2,3,1,2,3,1,2,3))

a <- df$pClass[which(is.na(df$Fare))] # find the pClass where Fare is missing

df$Fare[which(is.na(df$Fare))] <-   mean(df$Fare[df$pClass==a], na.rm=T) # replace the missinf Fare with mean of corresponding pClass

只有在缺少一个票价

的情况下才有效

答案 1 :(得分:0)

这必须有效......如果不是

,请告诉我

可能有apply更优雅的解决方案......但这也适用

#Creating a data frame named df
fare<- c(6,8,3,NA,5,1,0,7,NA,4,1,8,6,NA,2)
pclass<- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
df<-as.data.frame(cbind(fare,pclass))

#Creating a loop to look at each row
for(i in 1:length(df$fare)){

#And if the value for fare is missing
if(is.na(df$fare[i])){

#then, replace with the mean according to the group defined in pclass
df$fare[i]<- mean(df$fare[df$pclass==df$pclass[i]],na.rm = TRUE)

 }
}