我试图使用rvest和SelectorGadget从expedia网站上废弃航班价格以获取CSS选择器。以下是我的代码:
library(rvest)
library(lubridate)
url <- paste('https://www.expedia.com/Flights-Search?trip=oneway&leg1=from%3AAustin%2C%20TX%2C%20United%20States%20(AUS)%2Cto%3ASan%20Francisco%2C%20CA%2C%20United%20States%20of%20America%20(SFO)%2Cdeparture%3A', 06,'%2F', 10,'%2F',2018,'TANYT&passengers=adults%3A1%2Cchildren%3A0%2Cseniors%3A0%2Cinfantinlap%3AY&options=cabinclass%3Aeconomy&mode=search&origref=www.expedia.com', sep = "")
webpage <- read_html(url)
departure_time_data_html <- html_nodes(webpage,'.medium-bold span:nth-child(1)')
departure_time_data <- html_text(departure_time_data_html)
departure_time_data
在实际网站中,单个页面中有42个条目,但代码仅提取5个值。 以下是网站链接:
很高兴收到你们的任何人的来信。谢谢。
答案 0 :(得分:0)
网站将数据存储到JSON字符串中,该字符串由浏览器解析。实际上,您可以直接从该JSON字符串中提取信息。(以下是页面源。)
library(rvest)
library(jsonlite)
library(purrr)
url <- paste('https://www.expedia.com/Flights-Search?trip=oneway&leg1=from%3AAustin%2C%20TX%2C%20United%20States%20(AUS)%2Cto%3ASan%20Francisco%2C%20CA%2C%20United%20States%20of%20America%20(SFO)%2Cdeparture%3A', 06,'%2F', 10,'%2F',2018,'TANYT&passengers=adults%3A1%2Cchildren%3A0%2Cseniors%3A0%2Cinfantinlap%3AY&options=cabinclass%3Aeconomy&mode=search&origref=www.expedia.com', sep = "")
webpage <- read_html(url)
departure_time_data_html <- html_node(webpage,'#cachedResultsJson') # id to the json string
json_text <- departure_time_data_html %>% html_text() # get json string as text
result <- fromJSON(json_text) # parse the json string content into list
result1 <- fromJSON(result$content) # parse the json string content into list
result1$legs$`0c46a88d484464ad78b9a0985e80ab4e`$timeline$departureTime # a sample of how to extract info from one flight
map(result1$legs,~ .x$timeline$departureTime) # extract all info using map
样本结果:
> map(result1$legs,~ .x$timeline$departureTime)
$`0c46a88d484464ad78b9a0985e80ab4e`
date dateLongStr time dateTime travelDate isoStr hour
1 6/10/2018 Sun, Jun 10 7:05am 1.528632e+12 06/10/18 2018-06-10T07:05:00.000-05:00 NA
2 <NA> <NA> <NA> NA <NA> <NA> NA
3 6/10/2018 Sun, Jun 10 9:02am 1.528639e+12 06/10/18 2018-06-10T09:02:00.000-05:00 NA
$`90341ad9782711784a797ffeb22a5e44`
date dateLongStr time dateTime travelDate isoStr hour
1 6/10/2018 Sun, Jun 10 5:30pm 1.52867e+12 06/10/18 2018-06-10T17:30:00.000-05:00 NA
$c40e4d757819356926cc693ca1820827
date dateLongStr time dateTime travelDate isoStr hour
1 6/10/2018 Sun, Jun 10 7:50pm 1.528678e+12 06/10/18 2018-06-10T19:50:00.000-05:00 NA
2 <NA> <NA> <NA> NA <NA> <NA> NA
3 6/10/2018 Sun, Jun 10 9:42pm 1.528685e+12 06/10/18 2018-06-10T21:42:00.000-05:00 NA
$`83d7b1595e668e9c4fa886b164202f37`
date dateLongStr time dateTime travelDate isoStr hour
1 6/10/2018 Sun, Jun 10 5:54pm 1.528671e+12 06/10/18 2018-06-10T17:54:00.000-05:00 NA
2 <NA> <NA> <NA> NA <NA> <NA> NA
3 6/10/2018 Sun, Jun 10 7:45pm 1.528678e+12 06/10/18 2018-06-10T19:45:00.000-05:00 NA