如何将此数据强制转换为data.frame?

时间:2018-04-15 02:34:24

标签: r web-scraping data-science

我试图将这些数据解析成有意义的格式。我无法摆脱\ t \ n \ t \ t \ t。请帮忙。

#Loading the rvest package
library('rvest')

# Define the url once.
URL <- "https://rotogrinders.com/pages/pga-course-history-743469"

tablescrape_html <- read_html(URL)
tablescrape_html

tablescrape_html %>%
html_nodes("table") %>%
head()


tablescrape_html %>%

html_nodes("tr") %>% #grab the <td> tags
html_text() %>% # isolate the text from the html tages
gsub("^\\s+|\\s+$", "", .) %>% #strip the white space from the beginning and  end of a string.
head(n=100) # take a peek at the first 100 records

2 个答案:

答案 0 :(得分:1)

用一个空格替换用制表符,行尾和空格制作的分隔符,并使用标题和填充集传递给read.table:

input <- tablescrape_html %>%

html_nodes("tr") %>% 
html_text() %>% 
gsub("[\t\n ]+", " ", .) %>% 
read.table(text=., fill=TRUE, header=TRUE)
str(input)
> str(input)
'data.frame':   133 obs. of  15 variables:
 $ Golfer  : Factor w/ 107 levels "Aaron","Adam",..: 40 79 71 20 2 14 69 77 100 102 ...
 $ Rounds  : Factor w/ 129 levels "Armour","Baddeley",..: 60 12 73 111 44 47 78 11 48 110 ...
 $ Avg     : Factor w/ 18 levels "0","10","11",..: 1 10 6 12 2 5 2 8 8 4 ...
 $ Score   : Factor w/ 67 levels "","0","10","14",..: 1 37 11 12 25 18 44 7 15 35 ...
 $ Avg.1   : Factor w/ 74 levels "","13.00","14.00",..: 1 35 56 46 36 34 36 33 51 59 ...
 $ Fairways: Factor w/ 104 levels "","134.15","202.68",..: 1 41 44 64 49 42 46 69 35 32 ...
 $ Hit     : num  NA 33.5 47.8 42 37.7 ...
 $ Avg.2   : num  NA 28.4 27.4 26.5 28.1 ...
 $ Drive   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Yards   : logi  NA NA NA NA NA NA ...
 $ Avg.3   : logi  NA NA NA NA NA NA ...
 $ Greens  : logi  NA NA NA NA NA NA ...
 $ Hit.1   : logi  NA NA NA NA NA NA ...
 $ Avg.4   : logi  NA NA NA NA NA NA ...
 $ Putts   : logi  NA NA NA NA NA NA ...

答案 1 :(得分:0)

您需要按\t\n s填充向量。为了使它成为一个数据框,然后强制所有向量具有相同的长度将所有行绑定到一个表中。

library(tidyverse)
tablescrape_html %>%
  html_nodes("tr") %>% #grab the <td> tags
  html_text() %>% # isolate the text from the html tages
  gsub("^\\s+|\\s+$", "", .) %>% 
  str_split("\\t\\n\\t+[ \t]*") %>% 
  map(`length<-` ,7) %>% 
  do.call(rbind,.)