如何从多个xml节点输出中解析特定节点

时间:2017-02-08 10:52:37

标签: r web-scraping rvest

我想用rvest从特定网站获取GPS地址。当我在url上运行html_nodes()时,我得到xml_nodeset(35)。我想通过GPS [列表中的第24个]

到达特定节点

网站:https://www.doz.pl/apteki/a127238-DOZ_Apteka_Dbam_o_Zdrowie

当我跑步时:

Url %>% 
  html_node("span") %>%
  html_text()

output Toggle navigation

我只能进入第一个节点(切换导航),如何进入24节点?

复制选择器输出

"body > div.page.has-menu-bottom.relative.sticky-nav > section > div > section >
div > div.col-xs-12.col-sm-12.col-lg-9.col-md-9 > div.panel.margin__top-20 >
ul:nth-child(8) > li:nth-child(1) > span:nth-child(2)"

复制xpath输出

"/html/body/div[1]/section/div/section/div/div[2]/div[1]/ul[3]/li[1]/span[2]"

代码

library(rvest)
Url <- read_html("https://www.doz.pl/apteki/a127238-DOZ_Apteka_Dbam_o_Zdrowie")

Url %>% 
  html_nodes("span") 


ListOfNodes <- Url %>% 
  html_nodes("span") 

ListOfNodes[1:35]

   [1] <span class="sr-only">Toggle navigation</span>
  [2] <span class="icon-bar"></span>
  [3] <span class="icon-bar"></span>
  [4] <span class="icon-bar"></span>
  [5] <span class="badge badge-info"></span>
  [6] <span class="basket__price">\r\n                                                            0 ...
[7] <span class="icon"> </span>
  [8] <span class="allCategoriesLabel">Wszystkie kategorie</span>
  [9] <span class="list__definition">Adres apteki</span>
  [10] <span>Wolności 40, 84-300 Lębork</span>
  [11] <span class="list__definition">Dyżur pn-pt</span>
  [12] <span> 07:30-21:30</span>
  [13] <span class="list__definition">Dyżur sobota:</span>
  [14] <span>08:00-21:00</span>
  [15] <span class="list__definition">Dyżur niedziela</span>
  [16] <span>08:00-20:00</span>
  [17] <span class="list__definition">Telefon:</span>
  [18] <span>059 8622766</span>
  [19] <span class="list__definition">Email:</span>
  [20] <span><a href="mailto:%61%70%74%31%32%37%32%33%38@%64%62%61%6d0%6c..."
 [21] <span class="list__definition">Komunikator:</span>
 [22] <span>-</span>
 [23] <span class="list__definition">GPS:</span>
 [24] <span>17:44:47.09|54:32:25.63</span>
 [25] <span class="list__definition">Długość:</span>
 [26] <span>17.7464132000</span>
 [27] <span class="list__definition">Szerokość:</span>
 [28] <span>54.5404538000</span>
 [29] <span class="benefit__icon">\r\n                        <img src="/assets/doz/images/icons/pa ...
[30] <span class="benefit__icon">\r\n                        <img src="/assets/doz/images/icons/pr ...
[31] <span class="benefit__icon">\r\n                        <img src="/assets/doz/images/icons/de ...
[32] <span class="benefit__icon">\r\n                        <img src="/assets/doz/images/icons/ex ...
[33] <span class="cookie__message">\r\n                    Ważne: Użytkowanie Witryny oznacza zgod ...
[34] <span>Infolinia:</span>
[35] <span>Infolinia:</span>

1 个答案:

答案 0 :(得分:1)

您在代码部分中所做的是对的,您只需要从列表中提取第24个元素:

url <- "https://www.doz.pl/apteki/a127238-DOZ_Apteka_Dbam_o_Zdrowie"
read_html(url) %>% 
    html_nodes("span") %>% 
    '[['(24) %>% 
    html_text()

[1] "17:44:47.09|54:32:25.63"

要识别正确的节点,假设文本“GPS:”后始终,您可以使用Position()

pos <- Position(x = NodeList, f = function(x){ html_text(x)=='GPS:'}) + 1

管道看起来有点难看,但有效:

read_html(url) %>% 
    html_nodes("span")%>% 
    '[['(Position(x = ., f = function(x){ html_text(x)=='GPS:'}) + 1) %>% 
    html_text()