清理从Web上删除的数据

时间:2018-01-22 04:20:52

标签: r web-scraping rvest

对我来说有点新鲜,我一直致力于一个项目(只是为了好玩)来帮助我学习,而且我遇到了一些我似乎无法在网上寻找答案的事情。我正在努力教自己抓取网站上的数据,我已经开始使用下面的代码来检索247项运动中的一些数据。

library(rvest)
library(stringr)

link <- "https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank"

link.scrap <- read_html(link)
data <- 
  html_nodes(x   = link.scrap, 
             css = '#page-content > div.main-div.clearfix > section.list-page > section > section > ul.content-list.ri-list > li:nth-child(3)') %>%
  html_text(trim = TRUE) %>% 
  trimws()

当我查看数据时,它看起来是长度为1的向量,多个列表项存储为一个值。我遇到的问题是试图将它们分成各自的列。例如,当我运行下面的代码时,我认为应该将数据拆分为&#34;)&#34;然后从两个结果值中删除空格,我得到一个奇怪的结果。

f<-strsplit(data,")")
str_trim(f)
[1] "c(\"Ray Lima  El Camino College (Torrance, CA\", \"         DT 6-3 310    0.8681      39 4 9       Enrolled   1/9/2017\")"

我已经搞乱了其他一些事情,但没有成功。所以我想我的问题是,从这个html列表中获取数据并将其转换为每个数据点都有自己的列(即名称,大学,职位,统计数据等)的格式的最佳方式是什么? ?

2 个答案:

答案 0 :(得分:2)

我在代码中修改了一些内容。

  • 采用通用方法来引用css,从而能够提取整个行。

  • 将各列收集为矢量,然后构建数据框

请检查

library(rvest)
library(stringr)
library(tidyr)

link <- "https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank"

link.scrap <- read_html(link)

names <- link.scrap %>% html_nodes('div.name') %>% html_text()

pos <- link.scrap %>% html_nodes('ul.metrics-list') %>% html_text() 

status <- link.scrap %>% html_nodes('div.right-content.right') %>% html_text() 

data <- data.frame(names,pos,status, stringsAsFactors = F)

data <- data[-1,]

head(data)


> head(data)
                                                      names          pos                     status
2        Kamilo Tongamoa  Merced College (Merced, CA)        DT 6-5 320     Enrolled   8/24/2017   
3        Ray Lima  El Camino College (Torrance, CA)          DT 6-3 310      Enrolled   1/9/2017   
4  O'Rien Vance  George Washington (Cedar Rapids, IA)       OLB 6-3 235     Enrolled   6/12/2017   
5          Matt Leo  Arizona Western College (Yuma, AZ)     WDE 6-7 265     Enrolled   2/22/2017   
6            Keontae Jones  Colerain (Cincinnati, OH)         S 6-1 175     Enrolled   6/12/2017   
7      Cordarrius Bailey  Clarksdale (Clarksdale, MS)       WDE 6-4 210     Enrolled   6/12/2017   
> 

答案 1 :(得分:0)

基本问题是网页包含看起来像表的内容,但实际上它是一个包含大量样式的列表。这意味着您需要处理每个元素,拉出相关节点并根据需要进一步处理节点内容。

首先,抓住整个列表:

library(dplyr)
library(rvest)

iowa_state <- read_html("https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank") %>%
  html_nodes('ul.content-list.ri-list')

提取指标(位置,身高,体重)。这会创建一个向量,其中前3个元素是标题(Pos,Ht,Wt),然后每个玩家的指标一次填充其他三个元素。

metrics <- iowa_state %>% 
  html_nodes("ul.metrics-list li") %>% 
  html_text() %>% 
  trimws()

提取状态(&#34;已注册&#34;和日期)。这会创建一个向量,其中&#34;已注册&#34;填充元素1,3,5 ...和日期填充元素2,4,6 ...

status <- iowa_state %>% 
  html_nodes("p.commit-date") %>% 
  html_text() %>% 
  trimws()

现在我们可以逐列构建数据框(或tibble):

iowa_state_df <- tibble(name     = iowa_state %>% html_nodes("a.name") %>% html_text(),
                        college  = iowa_state %>% html_nodes("span.meta") %>% html_text() %>% trimws(),
                        pos      = metrics[seq(4, length(metrics)-2, 3)],
                        ht       = metrics[seq(5, length(metrics)-1, 3)],
                        wt       = metrics[seq(6, length(metrics), 3)],
                        score    = iowa_state %>% html_nodes("span.score") %>% html_text(),
                        natrank  = iowa_state %>% html_nodes("div.rank a.natrank") %>% html_text(),
                        posrank  = iowa_state %>% html_nodes("div.rank a.posrank") %>% html_text(),
                        sttrank  = iowa_state %>% html_nodes("div.rank a.sttrank") %>% html_text(),
                        enrolled = status[seq(1, length(status)-1, 2)],
                        date     = status[seq(2, length(status), 2)]
)

glimpse(iowa_state_df)

Observations: 26
Variables: 11
$ name     <chr> "Kamilo Tongamoa", "Ray Lima", "O'Rien Vance", "Matt Leo", "Keontae Jones", "Cordarriu...
$ college  <chr> "Merced College (Merced, CA)", "El Camino College (Torrance, CA)", "George Washington ...
$ pos      <chr> "DT", "DT", "OLB", "WDE", "S", "WDE", "WR", "CB", "CB", "DUAL", "SDE", "OT", "OT", "WR...
$ ht       <chr> "6-5", "6-3", "6-3", "6-7", "6-1", "6-4", "5-11", "6-1", "6-0.5", "6-4", "6-3", "6-5",...
$ wt       <chr> "320", "310", "235", "265", "175", "210", "170", "190", "170", "221", "250", "260", "3...
$ score    <chr> "0.8742", "0.8681", "0.8681", "0.8656", "0.8624", "0.8546", "0.8515", "0.8482", "0.847...
$ natrank  <chr> "28", "39", "508", "48", "587", "724", "806", "885", "924", "928", "929", "NA", "NA", ...
$ posrank  <chr> "3", "4", "29", "5", "42", "42", "117", "91", "100", "19", "42", "88", "90", "12", "57...
$ sttrank  <chr> "5", "9", "4", "7", "25", "13", "9", "124", "20", "8", "6", "10", "24", "37", "20", "1...
$ enrolled <chr> "Enrolled", "Enrolled", "Enrolled", "Enrolled", "Enrolled", "Enrolled", "Enrolled", "E...
$ date     <chr> "8/24/2017", "1/9/2017", "6/12/2017", "2/22/2017", "6/12/2017", "6/12/2017", "6/12/201...

然后,您可以根据需要格式化列的类型(日期,数字等。)。