我尝试使用以下代码将http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data中的数据加载到R中
hData <- read.table(file.choose(), sep = "\t", dec = ",", fileEncoding = "UTF-16")
但它没有填充确切的数据。数据中包含76个属性,详细信息请参见:http://archive.ics.uci.edu/ml/datasets/Heart+Disease。
有人能告诉我我做错了什么吗?
答案 0 :(得分:3)
该文件包含导致问题的额外换行符。如果你用正则表达式将它们删除,你可以阅读:
# read file into a single string
x <- readr::read_file('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data')
# or in base, x <- paste(readLines(url('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data')), collapse = '\n')
# gsub out line breaks that follow numbers (not "name") and read data
df <- read.table(text = gsub('(\\d)\\n', '\\1 ', x))
head(df, 2)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
## 1 1254 0 40 1 1 0 0 -9 2 140 0 289 -9 -9 -9 0 -9 -9 0 12 16 84 0 0 0
## 2 1255 0 49 0 1 0 0 -9 3 160 1 180 -9 -9 -9 0 -9 -9 0 11 16 84 0 0 0
## V26 V27 V28 V29 V30 V31 V32 V33 V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48
## 1 0 0 150 18 -9 7 172 86 200 110 140 86 0 0 0 -9 26 20 -9 -9 -9 -9 -9
## 2 0 0 -9 10 9 7 156 100 220 106 160 90 0 0 1 2 14 13 -9 -9 -9 -9 -9
## V49 V50 V51 V52 V53 V54 V55 V56 V57 V58 V59 V60 V61 V62 V63 V64 V65 V66 V67 V68 V69 V70 V71
## 1 -9 -9 -9 -9 -9 -9 12 20 84 0 -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 1 1 1
## 2 -9 -9 -9 -9 -9 -9 11 20 84 1 -9 -9 2 -9 -9 -9 -9 -9 -9 -9 1 1 1
## V72 V73 V74 V75 V76
## 1 1 1 -9 -9 name
## 2 1 1 -9 -9 name
如果最后没有恰好是不同的数据类型,您可以使用scan
制作一个向量,然后split
并重新组合:
# download data and split into a character vector
x <- scan(url('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data'), character())
# split and assemble data.frame
df <- data.frame(split(x, 1:76), stringsAsFactors = FALSE)
# fix types
df[] <- lapply(df, type.convert, as.is = TRUE)
或传递scan
一个单行应该直接读入列表的类型的模型:
x <- scan(url('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data'),
c(replicate(75, numeric()), list(character())))
df <- as.data.frame(x)
names(df) <- paste0('V', 1:76) # replace ugly names
如果确定类型结构的正确性太复杂,请使用replicate(76, character())
以字符形式阅读所有内容,并像上一个选项一样使用type.convert
。
或者,使用readLines
,split
为每个分组的行创建一个包含正确字符串的列表,并paste
将它们全部重新组合起来以使用read.table
:
x <- readLines(url('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/hungarian.data'))
df <- read.table(text = paste(sapply(split(x,
rep(seq(length(x) / 10), each = 10)),
paste, collapse = ' '), collapse = '\n'))