如何从HTML正文中抓取文本

时间:2019-03-27 18:22:24

标签: r web-scraping

我从来没有刮过。仅从下面的链接(以标题SRUS43 KMSR 271039开始,以.END结尾)中将主要的大灰色框中的文本抓取是否简单?我的最终目标是基本上从所有文本中得到三个整齐的数据列:五位数字代码,以英寸为单位的值以及盆地高程描述,因此也欢迎使用任何处理文本格式的指针。

https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6

感谢您的帮助。

2 个答案:

答案 0 :(得分:1)

这很容易用rvest抓取:

library(rvest)

text <- read_html("https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6") %>% 
  html_node(".notes") %>% 
  html_text() 

答案 1 :(得分:1)

阅读文本非常容易(请参阅@DiceBoyT答案)。清理三列的格式会涉及更多的工作。下面可以进行一些清理(尤其是使用正则表达式),但可以完成工作:

library(tidyverse)
library(rvest)

text <- read_html("https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=6") %>% 
  html_node(".notes") %>% 
  html_text() 

df <- tibble(txt = read_lines(text))

df %>%
  mutate(
    row = row_number(),
    with_code = str_extract(txt, "^[A-z0-9]{5}\\s+\\d+(\\.)?\\d"),
    wo_code = str_extract(txt, "^:?\\s+\\d+(\\.)?\\d") %>% str_extract("[:digit:]+\\.?[:digit:]"),
    basin_desc = if_else(!is.na(with_code), lag(txt, 1), NA_character_) %>% str_sub(start = 2)
  ) %>% 
  separate(with_code, c("code", "val"), sep = "\\s+") %>% 
  mutate(
    combined_val = case_when(
      !is.na(val) ~ val,
      !is.na(wo_code) ~ wo_code,
      TRUE ~ NA_character_
    ) %>% as.numeric
  ) %>%
  filter(!is.na(combined_val)) %>%
  mutate(
    code = zoo::na.locf(code),
    basin_desc = zoo::na.locf(basin_desc)
  ) %>%
  select(
    code, combined_val, basin_desc
  )
#> # A tibble: 643 x 3
#>    code  combined_val basin_desc               
#>    <chr>        <dbl> <chr>                    
#>  1 ACSC1          0   San Antonio Ck - Sunol   
#>  2 ADLC1          0   Arroyo De La Laguna      
#>  3 ADOC1          0   Santa Ana R - Prado Dam  
#>  4 AHOC1          0   Arroyo Honda nr San Jose 
#>  5 AKYC1         41   SF American nr Kyburz    
#>  6 AKYC1          3.2 SF American nr Kyburz    
#>  7 AKYC1         42.2 SF American nr Kyburz    
#>  8 ALQC1          0   Alamo Canal nr Pleasanton
#>  9 ALRC1          0   Alamitos Ck - Almaden Res
#> 10 ANDC1          0   Coyote Ck - Anderson Res 
#> # ... with 633 more rows

reprex package(v0.2.1)于2019-03-27创建