我有一个CSV文件。它位于scikit.learn库中。在python中构建任何预测模型之前,我想看一下每个属性与key属性的相关性。所以,我像这样导入了CSV file:
y <-read.csv("boston_house_prices.csv")
现在,我似乎无法执行任何描述性统计,或运行cor(y [,1:13],y [,14])。它说&#39; x&#39;不是数字。我试过了:
y <- as.data.frame(sapply(y, as.numeric))
和
y <- data.matrix(y)
现在,数据是数字的,我可以运行相关性。但是,如果我想运行基本统计数据,那么所有内容都会从&#34;转换&#34;发生了。有人能告诉我如何在运行cor()的同时保留数据本机的数据类型吗?为什么R必须将double / decimal值转换为整数才能运行?
感谢。
答案 0 :(得分:0)
使用skip = 1
阅读数据时,您可以使用read.csv
来避免此问题。我从原始数据中抓了几行,似乎工作正常。
第一行是不必要的,它实际上将标题行向下推入第一行,第一行又在读取时将列转换为因子。当您使用as.numeric
时,实际上是将所有因子值更改为其数值,这些值与原始数值不同,可能不正确。这是&#34;倾斜&#34;你形容。
txt <- '506,13,,,,,,,,,,,,
"CRIM","ZN","INDUS","CHAS","NOX","RM","AGE","DIS","RAD","TAX","PTRATIO","B","LSTAT","MEDV"
0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24
0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4'
您当前的通话会产生因素:
sapply(read.csv(text = txt), class)
# X506 X13 X X.1 X.2 X.3 X.4
# "factor" "factor" "factor" "factor" "factor" "factor" "factor"
# X.5 X.6 X.7 X.8 X.9 X.10 X.11
# "factor" "factor" "factor" "factor" "factor" "factor" "factor"
skip = 1
似乎可以解决问题,因为它会生成数字列:
sapply(read.csv(text = txt, skip = 1), class)
# CRIM ZN INDUS CHAS NOX RM AGE
# "numeric" "integer" "numeric" "integer" "numeric" "numeric" "numeric"
# DIS RAD TAX PTRATIO B LSTAT MEDV
# "numeric" "integer" "integer" "numeric" "numeric" "numeric" "numeric"
因此,如果您将第一行更改为
y <- read.csv("boston_house_prices.csv", skip = 1)
之后一切都应该没问题,没有其他必要的转换