无法使用rvest提取全面的数据

时间:2018-06-05 06:23:34

标签: r web-scraping rvest

我试图使用rvest和SelectorGadget从expedia网站上废弃航班价格以获取CSS选择器。以下是我的代码:

library(rvest)
library(lubridate)  

url <- paste('https://www.expedia.com/Flights-Search?trip=oneway&leg1=from%3AAustin%2C%20TX%2C%20United%20States%20(AUS)%2Cto%3ASan%20Francisco%2C%20CA%2C%20United%20States%20of%20America%20(SFO)%2Cdeparture%3A', 06,'%2F', 10,'%2F',2018,'TANYT&passengers=adults%3A1%2Cchildren%3A0%2Cseniors%3A0%2Cinfantinlap%3AY&options=cabinclass%3Aeconomy&mode=search&origref=www.expedia.com', sep = "")

  webpage <- read_html(url)

  departure_time_data_html <- html_nodes(webpage,'.medium-bold span:nth-child(1)')
  departure_time_data <- html_text(departure_time_data_html)
  departure_time_data

[1]“上午11:40”“上午7:45”“上午6:29”“上午6:00”“上午5:55”

在实际网站中,单个页面中有42个条目,但代码仅提取5个值。 以下是网站链接:

https://www.expedia.com/Flights-Search?trip=oneway&leg1=from%3AAustin%2C%20TX%2C%20United%20States%20(AUS)%2Cto%3ASan%20Francisco%2C%20CA%2C%20United%20States%20of%20America%20(SFO)%2Cdeparture%3A6%2F10%2F2018TANYT&passengers=adults%3A1%2Cchildren%3A0%2Cseniors%3A0%2Cinfantinlap%3AY&options=cabinclass%3Aeconomy&mode=search&origref=www.expedia.com

很高兴收到你们的任何人的来信。谢谢。

1 个答案:

答案 0 :(得分:0)

网站将数据存储到JSON字符串中,该字符串由浏览器解析。实际上,您可以直接从该JSON字符串中提取信息。(以下是页面源。)

enter image description here

library(rvest)
library(jsonlite)
library(purrr)

url <- paste('https://www.expedia.com/Flights-Search?trip=oneway&leg1=from%3AAustin%2C%20TX%2C%20United%20States%20(AUS)%2Cto%3ASan%20Francisco%2C%20CA%2C%20United%20States%20of%20America%20(SFO)%2Cdeparture%3A', 06,'%2F', 10,'%2F',2018,'TANYT&passengers=adults%3A1%2Cchildren%3A0%2Cseniors%3A0%2Cinfantinlap%3AY&options=cabinclass%3Aeconomy&mode=search&origref=www.expedia.com', sep = "")

webpage <- read_html(url)

departure_time_data_html <- html_node(webpage,'#cachedResultsJson') # id to the json string
json_text <- departure_time_data_html %>% html_text() # get json string as text

result <- fromJSON(json_text) # parse the json string content into list
result1 <- fromJSON(result$content) # parse the json string content into list

result1$legs$`0c46a88d484464ad78b9a0985e80ab4e`$timeline$departureTime # a sample of how to extract info from one flight

map(result1$legs,~ .x$timeline$departureTime) # extract all info using map

样本结果:

> map(result1$legs,~ .x$timeline$departureTime)
$`0c46a88d484464ad78b9a0985e80ab4e`
date dateLongStr   time     dateTime travelDate                        isoStr hour
1 6/10/2018 Sun, Jun 10 7:05am 1.528632e+12   06/10/18 2018-06-10T07:05:00.000-05:00   NA
2      <NA>        <NA>   <NA>           NA       <NA>                          <NA>   NA
3 6/10/2018 Sun, Jun 10 9:02am 1.528639e+12   06/10/18 2018-06-10T09:02:00.000-05:00   NA

$`90341ad9782711784a797ffeb22a5e44`
date dateLongStr   time    dateTime travelDate                        isoStr hour
1 6/10/2018 Sun, Jun 10 5:30pm 1.52867e+12   06/10/18 2018-06-10T17:30:00.000-05:00   NA

$c40e4d757819356926cc693ca1820827
date dateLongStr   time     dateTime travelDate                        isoStr hour
1 6/10/2018 Sun, Jun 10 7:50pm 1.528678e+12   06/10/18 2018-06-10T19:50:00.000-05:00   NA
2      <NA>        <NA>   <NA>           NA       <NA>                          <NA>   NA
3 6/10/2018 Sun, Jun 10 9:42pm 1.528685e+12   06/10/18 2018-06-10T21:42:00.000-05:00   NA

$`83d7b1595e668e9c4fa886b164202f37`
date dateLongStr   time     dateTime travelDate                        isoStr hour
1 6/10/2018 Sun, Jun 10 5:54pm 1.528671e+12   06/10/18 2018-06-10T17:54:00.000-05:00   NA
2      <NA>        <NA>   <NA>           NA       <NA>                          <NA>   NA
3 6/10/2018 Sun, Jun 10 7:45pm 1.528678e+12   06/10/18 2018-06-10T19:45:00.000-05:00   NA