使用rvest来抓取<a href="">&#39;s in <svg>&#39;s

时间:2018-02-09 22:42:42

标签: html r svg web-scraping rvest

I'm trying to scrape all links on a svg map. I'm sorry I can't post the link because you'd need a login, but the html file goes like this:

...
<div class = 'header'>
    <div class = 'headermenu'>
    ## only other place where there exists <a href> tabs
        <a href = 'link_is_here'>...</a>
        ## four more of them, separated with non-breaking spaces
    </div>
</div>
<div class = 'main'></div>
<svg parameters_of_graphic>
    ## paths for images
    <a href='link_is_here' parameters_of_link>
        ## path for above link
    </a>
    ## paths for images
    <a href='link_is_here' parameters_of_link>
        <circle parameters_of_circle></circle>
    </a>
    ## multiple circle links of same format
</svg>
...

However, when I use home_url %>% read_html() %>% html_nodes('a'), I only get the five nodes under the header class. I tried looking for svg scraping with rvest, but I couldn't find any way of scraping the nodes under the svg tab. Is there any way to do this in R?

1 个答案:

答案 0 :(得分:0)

我不能确定没有看到实际的网页,但我怀疑svg是用javascript动态生成的。 read_html()不会在页面上运行js,因此在读取页面时这些链接可能不存在。

您应该能够查看read_html()返回的内容以确认这一点。