您如何将物品刮在一起,以免丢失索引?

时间:2019-06-19 18:34:50

标签: r rvest

我正在使用RVest做一些基本的web抓取操作,并且正在获取返回结果,但是数据并没有相互对齐。意思是,我正在获取项目,但它们与站点顺序不符,因此无法将我要抓取的2个数据元素合并到data.frame中。

library(rvest)
library(tidyverse)

base_url<- "https://www.uchealth.com/providers"
loc <- read_html(base_url) %>%
  html_nodes('[class=locations]') %>%
  html_text() 
dept <- read_html(base_url) %>%
  html_nodes('[class=department last]') %>%
  html_text()

我期望能够创建一个数据框:

Location  Department

有什么建议吗?我想知道是否有一个索引可以将这些项目放在一起,但我什么都没看到。

编辑:我也尝试过这个,没有任何运气。似乎该位置的起始值有误:

scraping <- function(

base_url = "https://www.uchealth.com/providers"
)
{
loc <- read_html(base_url) %>%
  html_nodes('[class=locations]') %>%
  html_text() 

dept <- read_html(base_url) %>%
  html_nodes('[class=specialties]') %>%
  html_text()

data.frame(
  loc = ifelse(length(loc)==0, NA, loc),
  dept = ifelse(length(dept)==0, NA, loc), 
  stringsAsFactors=F
)

}

2 个答案:

答案 0 :(得分:2)

您面临的问题是,并非所有子节点都存在于所有父节点中。处理这些情况的最佳方法是收集列表/向量中的所有父节点,然后使用html_node函数从每个父节点中提取所需的信息。 html_node将始终为每个节点返回1个结果,即使它是NA。

library(rvest)

#read the page just onece
base_url<- "https://www.uchealth.com/providers"
page <- read_html(base_url)

#parse out the parent node for each parent
providers<-page %>% html_nodes('ul[id=providerlist]')  %>% html_children()

#parse out the requested information from each child.
dept<-providers %>% html_node("[class ^= 'department']") %>% html_text()
location<-providers %>%html_node('[class=locations]') %>% html_text()

提供者的长度,部门和位置都应相等。

答案 1 :(得分:2)

一个更复杂的选择是,首先将每个.searchresult节点中的所有可用数据转换为数据帧,然后使用dplyr::bind_rows进行堆叠。我认为这超出了您的基本要求,但仍然可以绕过您的问题,对于更一般的情况可能有用:

library(rvest)
library(tidyverse)

base_url<- "https://www.uchealth.com/providers"

html <- read_html(base_url)

# Extract `.searchresult` nodes.
res_list <- html %>% 
    html_nodes(".searchresult") %>% 
    unclass()

# Turn each node into a dataframe.
df_list <- res_list %>% 
    map(~ {html_nodes(., ".propertylist li") %>% 
            html_text(T) %>% 
            str_split(":", 2) %>%
            map(~ str_trim(.) %>% cbind() %>% as_tibble()) %>%
            bind_cols() %>%
            set_names(.[1,]) %>% 
            .[-1,]
    })

# Stack the dataframes, add the person names, and reorder the columns.
ucdf <- bind_rows(df_list) %>% 
    mutate(Name = map_chr(res_list, ~ html_node(., "h4") %>% html_text(T))) %>% 
    select(Name, 1:(ncol(.)-1))

哪个返回:

# A tibble: 1,137 x 5
   Name         Title                       Locations                                      Specialties              Department        
   <chr>        <chr>                       <chr>                                          <chr>                    <chr>             
 1 Adrian Abre… Assistant Professor of Med… UC Health Physicians Office South (West Chest… nephrology               Internal Medicine 
 2 Bassam G. A… Associate Professor of Cli… University of Cincinnati Medical Center: (513… nephrology, organ trans… Internal Medicine 
 3 Brian Adams… Professor, Director of Res… UC Health Physicians Office (Clifton - Piedmo… dermatology              Dermatology       
 4 Opeolu M. A… Associate Professor of Eme… University of Cincinnati Medical Center: (513… emergency medicine, neu… Emergency Medicine
 5 Caleb Adler… Professor in the Departmen… UC Health Psychiatry (Stetson Building): (513… psychiatrypsychology, m… Psychiatry & Beha…
 6 John Adler,… Assistant Professor of Obs… UC Health Women's Center: (513) 475-8248, UC … gynecology, robotic sur… OB/GYN            
 7 Steven S. A… Assistant Professor         UC Health Physicians Office (Clifton - Piedmo… orthopaedics, spine sur… Orthopaedics & Sp…
 8 Surabhi Aga… Assistant Professor of Med… Hoxworth Center: (513) 475-8524, UC Health Ph… rheumatology, connectiv… Internal Medicine 
 9 Saad S. Ahm… Assistant Professor of Med… Hoxworth Center: (513) 584-7217                cardiovascular disease,… Internal Medicine 
10 Syed Ahmad,… Professor of Surgery; Dire… UC Health Barrett Cancer Center: (513) 584-89… surgical oncology, canc… Surgery           
# … with 1,127 more rows