在R中下载HTML数据

时间:2018-11-03 12:34:22

标签: html r

我在尝试使用R-studio下载HTML表时遇到问题。我正在与所需的数据共享网址的图片。table that I want

我尝试通过使用普通的commands命令来在R中获得它

url是表所在的网站的URL。我不知道失败的原因是,将假设的数据作为NULL值获取:(。

有人知道如何下载该表格吗?

2 个答案:

答案 0 :(得分:0)

如果您只想导入可以看到的表,似乎是最快的方法是选择并复制表并使用剪贴板将其导入:

read.delim("clipboard")

对我来说很好。请注意,read.table无效,因为“注释”列对于大多数列都是空的。

有趣的是,您链接的页面提供了多种格式的数据(例如,包括分号或制表符分隔的值),这些格式比复制粘贴html更为方便。

答案 1 :(得分:0)

使用剪贴板操作是制作不可重现的分析工作流程的好方法。

如果您仔细查看搜索结果页面(而非表查看器),则会在标题附近的右侧看到三个字母:“ FTP”。单击它,很明显该站点支持FTP访问具有统一目录结构的数据:

library(httr)
library(tidyverse)

httr::GET(
  url = "ftp://cdsarc.u-strasbg.fr/pub/cats/I/239/h_dm_com.dat.gz",
  write_disk("h_dm_com.dat.gz")
) -> res

read_delim(
  file = "h_dm_com.dat.gz", 
  delim = "|", 
  col_names = FALSE,
  trim_ws = TRUE
) %>% 
  glimpse()
## Observations: 24,588
## Variables: 37
## $ X1  <chr> "00003-4417", "00003-4417", "00004-4711", "00004-4711", "0000...
## $ X2  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ X3  <chr> "L", "L", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "...
## $ X4  <chr> "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "...
## $ X5  <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "...
## $ X6  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "P", ...
## $ X7  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ X8  <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ X9  <int> 11, 11, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 12, 12, 9, 9, 11, 11, 9...
## $ X10 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ X11 <chr> "COMP", "COMP", "COMP", "COMP", "COMP", "COMP", "COMP", "COMP...
## $ X12 <int> 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1...
## $ X13 <chr> "A", "B", "A", "B", "A", "B", "A", "B", "A", "B", "A", "B", "...
## $ X14 <int> 25, 25, 37, 37, 40, 40, 45, 45, 50, 50, 55, 55, 71, 70, 96, 9...
## $ X15 <dbl> 6.894, 7.551, 10.966, 11.745, 11.007, 11.176, 9.890, 11.954, ...
## $ X16 <dbl> 0.004, 0.007, 0.092, 0.188, 0.017, 0.019, 0.012, 0.075, 0.003...
## $ X17 <dbl> NA, NA, NA, NA, NA, NA, 10.618, NA, 7.256, NA, 8.164, 9.797, ...
## $ X18 <dbl> NA, NA, NA, NA, NA, NA, 0.033, NA, 0.004, NA, 0.011, 0.073, 0...
## $ X19 <dbl> NA, NA, NA, NA, NA, NA, 9.808, NA, 6.579, NA, 7.613, 9.168, 8...
## $ X20 <dbl> NA, NA, NA, NA, NA, NA, 0.026, NA, 0.003, NA, 0.011, 0.064, 0...
## $ X21 <dbl> 0.07936537, 0.07924029, 0.10536643, 0.10532213, 0.12196971, 0...
## $ X22 <dbl> -44.29030, -44.29021, -47.17960, -47.17955, 67.21679, 67.2151...
## $ X23 <dbl> 13.74, 13.74, 3.74, 3.74, -3.40, -3.40, 15.10, 15.10, 16.89, ...
## $ X24 <dbl> 58.36, 69.09, -6.92, -6.92, -2.99, -2.99, -37.20, -37.20, 52....
## $ X25 <dbl> -108.64, -110.11, 7.03, 7.03, -3.18, -3.18, -2.78, -2.78, -20...
## $ X26 <dbl> 0.88, 1.82, 6.49, 18.42, 3.83, 8.46, 1.82, 18.78, 0.52, 12.46...
## $ X27 <dbl> 0.81, 1.69, 7.96, 20.65, 3.95, 8.08, 1.68, 18.00, 0.56, 13.11...
## $ X28 <dbl> 0.98, 0.98, 2.72, 2.72, 4.25, 4.25, 1.92, 1.92, 0.80, 0.80, 0...
## $ X29 <dbl> 0.73, 1.05, 2.23, 2.23, 4.14, 4.14, 1.95, 1.95, 0.56, 0.56, 0...
## $ X30 <dbl> 0.68, 1.05, 2.14, 2.14, 3.75, 3.75, 1.64, 1.64, 0.55, 0.55, 0...
## $ X31 <chr> NA, "A", NA, "A", NA, "A", NA, "A", NA, "A", NA, "A", NA, "A"...
## $ X32 <dbl> NA, 315.80, NA, 332.00, NA, 224.90, NA, 242.50, NA, 324.80, N...
## $ X33 <dbl> NA, 0.463, NA, 0.230, NA, 8.200, NA, 2.830, NA, 1.700, NA, 3....
## $ X34 <dbl> NA, 0.80, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.11, N...
## $ X35 <dbl> NA, -0.009, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.010...
## $ X36 <int> 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0...
## $ X37 <int> 111111, 111011, 111111, 111000, 111111, 111000, 111111, 11100...