使用RVest从网页抓取信息时出现问题

时间:2019-04-11 13:07:37

标签: r web-scraping rvest

希望您能帮助我吗?我正在尝试从以下URL中提取“结果”表:https://www.moneysupermarket.com/credit-cards/search/results/?goal=CC_ALLCARDS

我一直在使用RVEST,引用此博客文章:https://www.r-bloggers.com/a-text-mining-function-for-websites/

当我修改了要在uSwitch上使用的代码时,此方法起作用了,但我认为MSM网站更加复杂。

这是我的uSwitch代码

## Load the libraries
library(tidyverse)     # General purpose data wrangling
library(rvest)         # Parsing of html/xml files
library(stringr)       # String manipulation
library(rebus)         # Verbose regular expressions
library(lubridate)     # Eases datetime manipulation

################################################################################

##  BUILD THE BT TABLE

## Define the page to scrape
bt.url <- 'https://www.uswitch.com/credit-cards/credit-card-balance-transfers/'

## Get the top ten brands
bt.brand <- read_html(bt.url) %>%
     html_nodes(".us-ct-row__title--mobile strong") %>%
     html_text()

bt.primary.offer <- read_html(bt.url) %>%
     html_nodes(".us-ct-row__col--highlight") %>%
     html_text()

# Get the offer details
bt.offer.details <- read_html(bt.url) %>%
     html_nodes(".us-ct-row__key-details-col:nth-child(1)") %>%
     html_text()

bt.clean.offer.details <- bt.offer.details %>% 
     str_replace("^Card details*", "")

## Get the cost to the customer
bt.cost.to.cust <- read_html(bt.url) %>%
     html_nodes(".us-ct-row__name--fee span") %>%
     html_text()
## Create a list of even numbers
even.seq <- seq(2, 20, 2)
## Extract even obs because the £ sign and the value are split into separate rows
bt.cost.to.cust <- bt.cost.to.cust[even.seq]

## Get the APR
bt.apr <- read_html(bt.url) %>%
     html_nodes(".us-ct-row__col--highlight+ .us-ct-row__col--stretch .us-ct-row__name span") %>%
     html_text()

## Get the offer duration
# .us-ct-row__col--highlight strong
bt.offer.duration <- read_html(bt.url) %>%
     html_nodes(".us-ct-row__col--highlight strong") %>%
     html_text()

## Stitch it all together
bt.table <- as.matrix(cbind(bt.brand, bt.primary.offer, bt.offer.duration, bt.cost.to.cust, 
                            bt.apr, bt.clean.offer.details))

这一切都可以按我的意愿进行,如果可能的话,我只是想能够复制上面的网页?

失败了,我发现了一条建议在控制台的“网络”选项卡中检查DOC或XHR部分的帖子。

Scrape data from flash page using rvest

我这样做了,可以在控制台的“网络”>“ XHR”>“结果”>“预览”下看到结果表,但是我无法在R中将其拉回。这种方法最好,因为它可以提供比实际更多的信息呈现在页面上,但是经过大约一周的反复试验,我会采取任何措施!

0 个答案:

没有答案