我试图将这些数据解析成有意义的格式。我无法摆脱\ t \ n \ t \ t \ t。请帮忙。
#Loading the rvest package
library('rvest')
# Define the url once.
URL <- "https://rotogrinders.com/pages/pga-course-history-743469"
tablescrape_html <- read_html(URL)
tablescrape_html
tablescrape_html %>%
html_nodes("table") %>%
head()
tablescrape_html %>%
html_nodes("tr") %>% #grab the <td> tags
html_text() %>% # isolate the text from the html tages
gsub("^\\s+|\\s+$", "", .) %>% #strip the white space from the beginning and end of a string.
head(n=100) # take a peek at the first 100 records
答案 0 :(得分:1)
用一个空格替换用制表符,行尾和空格制作的分隔符,并使用标题和填充集传递给read.table:
input <- tablescrape_html %>%
html_nodes("tr") %>%
html_text() %>%
gsub("[\t\n ]+", " ", .) %>%
read.table(text=., fill=TRUE, header=TRUE)
str(input)
> str(input)
'data.frame': 133 obs. of 15 variables:
$ Golfer : Factor w/ 107 levels "Aaron","Adam",..: 40 79 71 20 2 14 69 77 100 102 ...
$ Rounds : Factor w/ 129 levels "Armour","Baddeley",..: 60 12 73 111 44 47 78 11 48 110 ...
$ Avg : Factor w/ 18 levels "0","10","11",..: 1 10 6 12 2 5 2 8 8 4 ...
$ Score : Factor w/ 67 levels "","0","10","14",..: 1 37 11 12 25 18 44 7 15 35 ...
$ Avg.1 : Factor w/ 74 levels "","13.00","14.00",..: 1 35 56 46 36 34 36 33 51 59 ...
$ Fairways: Factor w/ 104 levels "","134.15","202.68",..: 1 41 44 64 49 42 46 69 35 32 ...
$ Hit : num NA 33.5 47.8 42 37.7 ...
$ Avg.2 : num NA 28.4 27.4 26.5 28.1 ...
$ Drive : num NA NA NA NA NA NA NA NA NA NA ...
$ Yards : logi NA NA NA NA NA NA ...
$ Avg.3 : logi NA NA NA NA NA NA ...
$ Greens : logi NA NA NA NA NA NA ...
$ Hit.1 : logi NA NA NA NA NA NA ...
$ Avg.4 : logi NA NA NA NA NA NA ...
$ Putts : logi NA NA NA NA NA NA ...
答案 1 :(得分:0)
您需要按\t\n
s填充向量。为了使它成为一个数据框,然后强制所有向量具有相同的长度将所有行绑定到一个表中。
library(tidyverse)
tablescrape_html %>%
html_nodes("tr") %>% #grab the <td> tags
html_text() %>% # isolate the text from the html tages
gsub("^\\s+|\\s+$", "", .) %>%
str_split("\\t\\n\\t+[ \t]*") %>%
map(`length<-` ,7) %>%
do.call(rbind,.)