使用rvest从HTML中读取

时间:2017-10-22 15:12:24

标签: r web-scraping rvest

是否可以使用rvest包读取存储在input type =“radio”标签中的文本,然后是TAG span class =“glyphicon glyphicon-ok”。例如:我想在字符向量中读取“碳水化合物和脂肪”

R代码#不起作用,并且NA存储在p_ans

install.packages('rvest')
library('rvest')

url <- 'http://upscfever.com/upsc-fever/en/test/en-test-sci1.html'

webpage <- read_html(url)

p_ans <- webpage %>%
        html_nodes("input + glyphicon-ok") %>%
        html_text()

HTML代码

<div class="form-group" id="myform">
            <label for="usr">Q1: Energy giving foods are </label>
     </div>
    <div class="radio">
      <label><input type="radio" value="1" name="optradio0">Carbohydrates and fats<span class="glyphicon glyphicon-ok"></span></label>
    </div>
    <div class="radio">
      <label><input type="radio" id="opt1" value="-0.33" name="optradio0">Carbohydrates and Proteins<span id="sp1" class="glyphicon glyphicon-remove"></span></label>
    </div>

1 个答案:

答案 0 :(得分:0)

library(rvest)

pg <- read_html("http://upscfever.com/upsc-fever/en/test/en-test-sci1.html")
html_nodes(pg, xpath=".//label[input and span[contains(@class, 'glyphicon glyphicon-ok')]]") %>% 
  html_text()
##  [1] "Carbohydrates and fats"                                         
##  [2] "saturated fatty acids"                                          
##  [3] "unsaturated fatty acids are good for health"                    
##  [4] "unsaturated fats"                                               
##  [5] "polypeptides"                                                   
##  [6] "Maerasmus"                                                      
##  [7] "Ribulose bisphosphate Carboxylase-Oxygenase "                   
##  [8] "Mercury"                                                        
##  [9] "Cadmium"                                                        
## [10] "Absorb free radicals"                                           
## [11] "A"                                                              
## [12] "Calcium - Goitre"                                               
## [13] "none"                                                           
## [14] "Excretion of undigested food"                                   
## [15] " complex components of food are broken into simpler substances."
## [16] "starch to sugar"                                                
## [17] "protection of stomach lining"                                   
## [18] "Liver"                                                          
## [19] "digestion of fats"                                              
## [20] "only HDC is good"                                               
## [21] "35-42"                                                          
## [22] "absorption of food"                                             
## [23] "digest cellulose"                                               
## [24] "meat is easily digested"                                        
## [25] "gall bladder"