我正在抓住纽瓦克自由国际机场的网站,以便跟踪他们的日程安排。这是我开发的一段代码:
library(rvest)
url <- read_html('https://www.airport-ewr.com/newark-departures-terminal-C?
tp=6&day=tomorrow')
population <- url %>% html_nodes(xpath = '//*[@id="flight_detail"]') %>%
html_text() %>% gsub(pattern = '\\t|\\r|\\n', replacement = ' ') %>%
trimws() %>% gsub(pattern = '\\s+', replacement = " ")
gsub()用于删除文本中的前导空格和尾随空格以及额外空格。代码运作良好,我附上了输出的片段:
我想将此字符串转换为包含如下所示值的数据帧:
任何帮助表示赞赏!!
答案 0 :(得分:3)
试试这个:
library(rvest)
url <- read_html('https://www.airport-ewr.com/newark-departures-terminal-C?tp=6&day=tomorrow')
population <- url %>% html_nodes(xpath = '//*[@id="flight_detail"]') %>%
html_text()
首先我们读取原始文本行。然后我注意到每列都以\n
分隔,但有时候不止一个,所以首先我们gsub
输出额外的\n
分隔符,然后将字符串拆分为{{1}输出\n
和rbind
输出到data.frame
popDF <- as.data.frame(
do.call('rbind',strsplit(gsub("(\\n)+", "\\\n",population),split="\n", fixed=TRUE))
)
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 Austin (AUS) United Airlines UA 2427 06:00 am Depart: 06:00 am C Term. C Scheduled - On-time [+]
2 Austin (AUS) SAS SK 6868 06:00 am Depart: 06:00 am C Term. C Scheduled - On-time [+]
3 Boston (BOS) United Airlines UA 1699 06:00 am Depart: 06:00 am C Term. C Scheduled - On-time [+]
4 Columbus (CMH) CommutAir C5 4973 06:00 am Depart: 06:00 am C Term. C Scheduled - On-time [+]
5 Columbus (CMH) United Airlines UA 4973 06:00 am Depart: 06:00 am C Term. C Scheduled - On-time [+]
6 Detroit (DTW) Republic Airlines YX 3482 06:00 am Depart: 06:00 am C Term. C Scheduled - On-time [+]