我是R的新手,我想阅读来自http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html
的谷物数据有人可以帮帮我吗?我从以下代码开始,但最后我没有得到任何以我想要的格式返回的数据 - CSV。
get_data <- readLines("http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html")
get_data[
grep("name","mfr","type","calories","protein","fat",
"sodium","fiber","carbo","sugars","potass","vitamins",
"shelf","weight","cups","rating", get_data )
]
我得到的上述代码的结果是“character(0)”
答案 0 :(得分:1)
这似乎有效,但需要查看原始HTML以识别可用于查找相应行的标记:
get_data <- readLines("http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html")
pre.lines <- grep("PRE",get_data)
cereals <- read.table(text=get_data[(pre.lines[1]+1):(pre.lines[2]-1)],
header=TRUE)
答案 1 :(得分:1)
不得不稍微摆弄它,但这样可以干净利落。相关表格的开头和结尾都有新的行字符。
library(XML)
URL <- "http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html"
xml <- xmlValue(getNodeSet(htmlParse(URL), "//pre")[[1]])
rt <- read.table(text = xml, header = TRUE)
str(rt)
# 'data.frame': 77 obs. of 16 variables:
# $ name : Factor w/ 77 levels "100%_Bran","100%_Natural_Bran",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ mfr : Factor w/ 7 levels "A","G","K","N",..: 4 6 3 3 7 2 3 2 7 5 ...
# $ type : Factor w/ 2 levels "C","H": 1 1 1 1 1 1 1 1 1 1 ...
# $ calories: int 70 120 70 50 110 110 110 130 90 90 ...
# $ protein : int 4 3 4 4 2 2 2 3 2 3 ...
# $ fat : int 1 5 1 0 2 2 0 2 1 0 ...
# $ sodium : int 130 15 260 140 200 180 125 210 200 210 ...
# $ fiber : num 10 2 9 14 1 1.5 1 2 4 5 ...
# $ carbo : num 5 8 7 8 14 10.5 11 18 15 13 ...
# $ sugars : int 6 8 5 0 8 10 14 8 6 5 ...
# $ potass : int 280 135 320 330 -1 70 30 100 125 190 ...
# $ vitamins: int 25 0 25 25 25 25 25 25 25 25 ...
# $ shelf : int 3 3 3 3 3 1 2 3 1 3 ...
# $ weight : num 1 1 1 1 1 1 1 1.33 1 1 ...
# $ cups : num 0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
# $ rating : num 68.4 34 59.4 93.7 34.4 ...
答案 2 :(得分:0)
如果只是一次,你最好从浏览器复制整个表格然后在R中:
my.data <- read.table("clipboard",header=TRUE)