问题:使用rvest我似乎无法从我通过幻像js呈现的html页面找到我需要的信息块。我已经尝试了几乎所有可能的格式,但我似乎无法让html_node获取正确的块。
幽灵渲染的html:
<div class="page">
<div class="main-header">
</script>
<div id="listing-703036966" class="shop-srp-listings__listing">
<div class="card listing-row--search hide-fade">
<div class="listing-row__main">
<div class="listing-row__image">
<div class="media-count shadowed">
<a href="/vehicledetail/detail/703036966/overview/" target="_self" class="media-count--photo" data-goto-vdp="703036966" data-standard-link="md-thumb">
25 Photos
</a>
<a href="/vehicledetail/detail/703036966/overview/" target="_self" class="media-count--video" data-goto-vdp="703036966" data-standard-link="md-thumb">
1 Video
</a>
</div>
<a href="/vehicledetail/detail/703036966/overview/" target="_self" class="gray-bg listing-row__photo" data-goto-vdp="703036966" data-standard-link="md-thumb">
<img alt="New 2018 BMW 750 i" src="https://www.cstatic-images.com/phototab/e/1/4/e2/f87fb57ec51cab4f57cbaeb9f9f.jpg" onload="window.performance.mark('serverSideFirstPhotoLoaded')">
</a>
<div class="compare-srp">
<div class="listing-row__save">
<a id="703036966" class="switch-favorite unsaved saveVehicleHeart compare-switch-favorite" savedfeatureinstance="" vehicle="{"listingId":703036966,"mkId":20005,"mkNm":"BMW","mdId":20536,"mdNm":"750","trimId":25905,"trimName":"i","modelYearId":35797618,"modelYear":2018,"stkTyp":"New","state":"NC","zipcode":"27107"}" cars-common-omniture-custom="" omniture-events="">
<div class="save-icon-wrapper">
<div class="cui-icon icon-heart-line">
<svg width="16" height="16" class="icon-image">
<use xlink:href="#cui-icon-heart-outline"></use>
</svg>
</div>
<div class="cui-icon icon-heart">
<svg width="16" height="16" class="icon-image">
<use xlink:href="#cui-icon-heart-fill"></use>
</svg>
</div>
</div>
<p class="saved-label">Save</p>
</a>
</div>
<div class="compare-button" data-compare-listing="703036966">
<div class="compare-icon-wrapper">
<div class="cui-icon icon-plus-sign">
<svg width="16" height="16" class="icon-plus-sign">
<use xlink:href="#cui-icon-plus-sign"></use>
</svg>
</div>
<div class="cui-icon icon-checkmark">
<svg width="16" height="16" class="icon-checkmark">
<use xlink:href="#cui-icon-checkmark"></use>
</svg>
</div>
</div>
<p class="compare-button__label compare">Compare</p>
<p class="compare-button__label added">Added</p>
</div>
</div>
</div>
等
我在R
做了什么library(rvest)
library(stringr)
library(plyr)
library(dplyr)
library(ggvis)
library(knitr)
library(tidyverse)
cars <- read_html("my file.html") %>%
html_nodes("div") %>%
html_text()
但是,当我检查汽车矢量时,我完全错过了所需的代码块:
<a id="703036966" class="switch-favorite unsaved saveVehicleHeart compare-switch-favorite" savedfeatureinstance="" vehicle=". {"listingId":703036966,"mkId":20005,"mkNm":"BMW","mdId":20536,"mdNm":"750","trimId":25905,"trimName":"i","modelYearId":35797618,"modelYear":2018,"stkTyp":"New","state":"NC","zipcode":"27107"}" cars-common-omniture-custom="" omniture-events="">
但它永远不会被转换为可用的形式,而我尝试的所有不同节点都会失去它(div,p,span)。
有什么想法吗?
答案 0 :(得分:1)
您似乎希望从单个节点解析括号内容。 即:字符串&#34; vehicle =&#39; {&#34; listingId&#34;:703036966,...&#34; ,来自具有css路径的节点&# 34; id.703036966 saveVehicleHeart&#34; 。
由于此节点不包含要在html浏览器中呈现的文本,因此命令html_text()将无处可寻。相反,您可以将节点的代码存储为字符串,然后解析感兴趣的部分。
<强> 1。检索节点的字符串。节点的几条可能的css路径之一是&#39; .saveVehicleHeart&#39;
library(rvest)
library(stringr)
library(dplyr)
car_html <- read_html("my file.html")
cars <- as.character(html_node(car_html, css = '.saveVehicleHeart'))
2.在括号内提取内容&#34; {}&#34;
cars <- cars %>%
str_match(., "\\{.*?\\}") %>% ## Extract everything between the first "{" and the subsequent "}"
gsub("\\{|\\}", "", .) ## Remove the characters "{" and "}"
第3。奖金。把它放到一个很好的数据框架中。你没有要求这个,但它可能会有所帮助。
df_cars <- cars %>%
cbind(read.table(text = ., sep = (','))) %>%
t() %>%
as_data_frame() %>%
.[-1,] %>% ## The first row contains the original unparsed string. We drop it.
separate(., V1, into = c("Variable", "Value"), sep = "\\:")
df_cars
# A tibble: 12 × 2
Variable Value
* <chr> <chr>
1 listingId 703036966
2 mkId 20005
3 mkNm BMW
4 mdId 20536
5 mdNm 750
6 trimId 25905
7 trimName i
8 modelYearId 35797618
9 modelYear 2018
10 stkTyp New
11 state NC
12 zipcode 27107