我想用rvest从特定网站获取GPS地址。当我在url上运行html_nodes()时,我得到xml_nodeset(35)。我想通过GPS [列表中的第24个]
到达特定节点网站:https://www.doz.pl/apteki/a127238-DOZ_Apteka_Dbam_o_Zdrowie
当我跑步时:
Url %>%
html_node("span") %>%
html_text()
output Toggle navigation
我只能进入第一个节点(切换导航),如何进入24节点?
"body > div.page.has-menu-bottom.relative.sticky-nav > section > div > section >
div > div.col-xs-12.col-sm-12.col-lg-9.col-md-9 > div.panel.margin__top-20 >
ul:nth-child(8) > li:nth-child(1) > span:nth-child(2)"
"/html/body/div[1]/section/div/section/div/div[2]/div[1]/ul[3]/li[1]/span[2]"
library(rvest)
Url <- read_html("https://www.doz.pl/apteki/a127238-DOZ_Apteka_Dbam_o_Zdrowie")
Url %>%
html_nodes("span")
ListOfNodes <- Url %>%
html_nodes("span")
ListOfNodes[1:35]
[1] <span class="sr-only">Toggle navigation</span>
[2] <span class="icon-bar"></span>
[3] <span class="icon-bar"></span>
[4] <span class="icon-bar"></span>
[5] <span class="badge badge-info"></span>
[6] <span class="basket__price">\r\n 0 ...
[7] <span class="icon"> </span>
[8] <span class="allCategoriesLabel">Wszystkie kategorie</span>
[9] <span class="list__definition">Adres apteki</span>
[10] <span>Wolności 40, 84-300 Lębork</span>
[11] <span class="list__definition">Dyżur pn-pt</span>
[12] <span> 07:30-21:30</span>
[13] <span class="list__definition">Dyżur sobota:</span>
[14] <span>08:00-21:00</span>
[15] <span class="list__definition">Dyżur niedziela</span>
[16] <span>08:00-20:00</span>
[17] <span class="list__definition">Telefon:</span>
[18] <span>059 8622766</span>
[19] <span class="list__definition">Email:</span>
[20] <span><a href="mailto:%61%70%74%31%32%37%32%33%38@%64%62%61%6d0%6c..."
[21] <span class="list__definition">Komunikator:</span>
[22] <span>-</span>
[23] <span class="list__definition">GPS:</span>
[24] <span>17:44:47.09|54:32:25.63</span>
[25] <span class="list__definition">Długość:</span>
[26] <span>17.7464132000</span>
[27] <span class="list__definition">Szerokość:</span>
[28] <span>54.5404538000</span>
[29] <span class="benefit__icon">\r\n <img src="/assets/doz/images/icons/pa ...
[30] <span class="benefit__icon">\r\n <img src="/assets/doz/images/icons/pr ...
[31] <span class="benefit__icon">\r\n <img src="/assets/doz/images/icons/de ...
[32] <span class="benefit__icon">\r\n <img src="/assets/doz/images/icons/ex ...
[33] <span class="cookie__message">\r\n Ważne: Użytkowanie Witryny oznacza zgod ...
[34] <span>Infolinia:</span>
[35] <span>Infolinia:</span>
答案 0 :(得分:1)
您在代码部分中所做的是对的,您只需要从列表中提取第24个元素:
url <- "https://www.doz.pl/apteki/a127238-DOZ_Apteka_Dbam_o_Zdrowie"
read_html(url) %>%
html_nodes("span") %>%
'[['(24) %>%
html_text()
[1] "17:44:47.09|54:32:25.63"
要识别正确的节点,假设文本“GPS:”后始终,您可以使用Position()
:
pos <- Position(x = NodeList, f = function(x){ html_text(x)=='GPS:'}) + 1
管道看起来有点难看,但有效:
read_html(url) %>%
html_nodes("span")%>%
'[['(Position(x = ., f = function(x){ html_text(x)=='GPS:'}) + 1) %>%
html_text()