读取幻像将HTML呈现为R.

时间:2017-08-03 22:02:22

标签: javascript r phantomjs rvest

问题:使用rvest我似乎无法从我通过幻像js呈现的html页面找到我需要的信息块。我已经尝试了几乎所有可能的格式,但我似乎无法让html_node获取正确的块。

幽灵渲染的html:

<div class="page">

<div class="main-header">    
</script>

    <div id="listing-703036966" class="shop-srp-listings__listing">
        <div class="card listing-row--search hide-fade">

            <div class="listing-row__main">
                <div class="listing-row__image">

                    <div class="media-count shadowed">
                        <a href="/vehicledetail/detail/703036966/overview/" target="_self" class="media-count--photo" data-goto-vdp="703036966" data-standard-link="md-thumb">
                            25 Photos
                        </a>

                            <a href="/vehicledetail/detail/703036966/overview/" target="_self" class="media-count--video" data-goto-vdp="703036966" data-standard-link="md-thumb">
                                1 Video
                            </a>
                    </div>

                    <a href="/vehicledetail/detail/703036966/overview/" target="_self" class="gray-bg listing-row__photo" data-goto-vdp="703036966" data-standard-link="md-thumb">
                        <img alt="New 2018 BMW 750 i" src="https://www.cstatic-images.com/phototab/e/1/4/e2/f87fb57ec51cab4f57cbaeb9f9f.jpg" onload="window.performance.mark('serverSideFirstPhotoLoaded')">
                    </a>
                    <div class="compare-srp">
                        <div class="listing-row__save">
                            <a id="703036966" class="switch-favorite unsaved saveVehicleHeart  compare-switch-favorite" savedfeatureinstance="" vehicle="{&quot;listingId&quot;:703036966,&quot;mkId&quot;:20005,&quot;mkNm&quot;:&quot;BMW&quot;,&quot;mdId&quot;:20536,&quot;mdNm&quot;:&quot;750&quot;,&quot;trimId&quot;:25905,&quot;trimName&quot;:&quot;i&quot;,&quot;modelYearId&quot;:35797618,&quot;modelYear&quot;:2018,&quot;stkTyp&quot;:&quot;New&quot;,&quot;state&quot;:&quot;NC&quot;,&quot;zipcode&quot;:&quot;27107&quot;}" cars-common-omniture-custom="" omniture-events="">
                                <div class="save-icon-wrapper">
                                    <div class="cui-icon icon-heart-line">
                                        <svg width="16" height="16" class="icon-image">
                                            <use xlink:href="#cui-icon-heart-outline"></use>
                                        </svg>
                                    </div>

                                    <div class="cui-icon icon-heart">
                                        <svg width="16" height="16" class="icon-image">
                                            <use xlink:href="#cui-icon-heart-fill"></use>
                                        </svg>
                                    </div>
                                </div>

                                <p class="saved-label">Save</p>
                            </a>
                        </div>
                        <div class="compare-button" data-compare-listing="703036966">
                            <div class="compare-icon-wrapper">
                                <div class="cui-icon icon-plus-sign">
                                    <svg width="16" height="16" class="icon-plus-sign">
                                        <use xlink:href="#cui-icon-plus-sign"></use>
                                    </svg>
                                </div>
                                <div class="cui-icon icon-checkmark">
                                    <svg width="16" height="16" class="icon-checkmark">
                                        <use xlink:href="#cui-icon-checkmark"></use>
                                    </svg>
                                </div>
                            </div>
                            <p class="compare-button__label compare">Compare</p>
                            <p class="compare-button__label added">Added</p>
                        </div>
                    </div>
                </div>

我在R

做了什么
library(rvest)
library(stringr)
library(plyr)
library(dplyr)
library(ggvis)
library(knitr)
library(tidyverse)

cars <- read_html("my file.html") %>%
    html_nodes("div") %>%
    html_text()

但是,当我检查汽车矢量时,我完全错过了所需的代码块:

<a id="703036966" class="switch-favorite unsaved saveVehicleHeart         compare-switch-favorite" savedfeatureinstance="" vehicle=".   {&quot;listingId&quot;:703036966,&quot;mkId&quot;:20005,&quot;mkNm&quot;:&quot;BMW&quot;,&quot;mdId&quot;:20536,&quot;mdNm&quot;:&quot;750&quot;,&quot;trimId&quot;:25905,&quot;trimName&quot;:&quot;i&quot;,&quot;modelYearId&quot;:35797618,&quot;modelYear&quot;:2018,&quot;stkTyp&quot;:&quot;New&quot;,&quot;state&quot;:&quot;NC&quot;,&quot;zipcode&quot;:&quot;27107&quot;}" cars-common-omniture-custom="" omniture-events="">

但它永远不会被转换为可用的形式,而我尝试的所有不同节点都会失去它(div,p,span)。

有什么想法吗?

1 个答案:

答案 0 :(得分:1)

您似乎希望从单个节点解析括号内容。 即:字符串&#34; vehicle =&#39; {&#34; listingId&#34;:703036966,...&#34; ,来自具有css路径的节点&# 34; id.703036966 saveVehicleHeart&#34;

由于此节点不包含要在html浏览器中呈现的文本,因此命令html_text()将无处可寻。相反,您可以将节点的代码存储为字符串,然后解析感兴趣的部分。

<强> 1。检索节点的字符串。节点的几条可能的css路径之一是&#39; .saveVehicleHeart&#39;

library(rvest)
library(stringr)
library(dplyr)
car_html <- read_html("my file.html")
cars <- as.character(html_node(car_html, css = '.saveVehicleHeart'))

2.在括号内提取内容&#34; {}&#34;

cars <- cars %>%
str_match(., "\\{.*?\\}") %>% ## Extract everything between the first "{" and the subsequent "}" 
gsub("\\{|\\}", "", .) ## Remove the characters "{" and "}"

第3。奖金。把它放到一个很好的数据框架中。你没有要求这个,但它可能会有所帮助。

df_cars <- cars %>% 
   cbind(read.table(text = ., sep = (','))) %>%
   t() %>% 
   as_data_frame() %>%
   .[-1,] %>% ## The first row contains the original unparsed string. We drop it.
   separate(., V1, into = c("Variable", "Value"), sep = "\\:")
df_cars

# A tibble: 12 × 2
      Variable     Value
*        <chr>     <chr>
1    listingId 703036966
2         mkId     20005
3         mkNm       BMW
4         mdId     20536
5         mdNm       750
6       trimId     25905
7     trimName         i
8  modelYearId  35797618
9    modelYear      2018
10      stkTyp       New
11       state        NC
12     zipcode     27107