从grainals.html页面读取数据到R错误

时间:2014-11-06 23:53:54

标签: r

我是R的新手,我想阅读来自http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html

的谷物数据

有人可以帮帮我吗?我从以下代码开始,但最后我没有得到任何以我想要的格式返回的数据 - CSV。

get_data <- readLines("http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html")

get_data[

grep("name","mfr","type","calories","protein","fat",
     "sodium","fiber","carbo","sugars","potass","vitamins", 
      "shelf","weight","cups","rating", get_data )
]

我得到的上述代码的结果是“character(0)”

3 个答案:

答案 0 :(得分:1)

这似乎有效,但需要查看原始HTML以识别可用于查找相应行的标记:

get_data <- readLines("http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html")
pre.lines <- grep("PRE",get_data)
cereals <- read.table(text=get_data[(pre.lines[1]+1):(pre.lines[2]-1)],
                      header=TRUE)

答案 1 :(得分:1)

不得不稍微摆弄它,但这样可以干净利落。相关表格的开头和结尾都有新的行字符。

library(XML)
URL <- "http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html"
xml <- xmlValue(getNodeSet(htmlParse(URL), "//pre")[[1]])
rt <- read.table(text = xml, header = TRUE)
str(rt)
# 'data.frame':    77 obs. of  16 variables:
#     $ name    : Factor w/ 77 levels "100%_Bran","100%_Natural_Bran",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ mfr     : Factor w/ 7 levels "A","G","K","N",..: 4 6 3 3 7 2 3 2 7 5 ...
# $ type    : Factor w/ 2 levels "C","H": 1 1 1 1 1 1 1 1 1 1 ...
# $ calories: int  70 120 70 50 110 110 110 130 90 90 ...
# $ protein : int  4 3 4 4 2 2 2 3 2 3 ...
# $ fat     : int  1 5 1 0 2 2 0 2 1 0 ...
# $ sodium  : int  130 15 260 140 200 180 125 210 200 210 ...
# $ fiber   : num  10 2 9 14 1 1.5 1 2 4 5 ...
# $ carbo   : num  5 8 7 8 14 10.5 11 18 15 13 ...
# $ sugars  : int  6 8 5 0 8 10 14 8 6 5 ...
# $ potass  : int  280 135 320 330 -1 70 30 100 125 190 ...
# $ vitamins: int  25 0 25 25 25 25 25 25 25 25 ...
# $ shelf   : int  3 3 3 3 3 1 2 3 1 3 ...
# $ weight  : num  1 1 1 1 1 1 1 1.33 1 1 ...
# $ cups    : num  0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
# $ rating  : num  68.4 34 59.4 93.7 34.4 ...

答案 2 :(得分:0)

如果只是一次,你最好从浏览器复制整个表格然后在R中:

my.data <- read.table("clipboard",header=TRUE)