对我来说有点新鲜,我一直致力于一个项目(只是为了好玩)来帮助我学习,而且我遇到了一些我似乎无法在网上寻找答案的事情。我正在努力教自己抓取网站上的数据,我已经开始使用下面的代码来检索247项运动中的一些数据。
library(rvest)
library(stringr)
link <- "https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank"
link.scrap <- read_html(link)
data <-
html_nodes(x = link.scrap,
css = '#page-content > div.main-div.clearfix > section.list-page > section > section > ul.content-list.ri-list > li:nth-child(3)') %>%
html_text(trim = TRUE) %>%
trimws()
当我查看数据时,它看起来是长度为1的向量,多个列表项存储为一个值。我遇到的问题是试图将它们分成各自的列。例如,当我运行下面的代码时,我认为应该将数据拆分为&#34;)&#34;然后从两个结果值中删除空格,我得到一个奇怪的结果。
f<-strsplit(data,")")
str_trim(f)
[1] "c(\"Ray Lima El Camino College (Torrance, CA\", \" DT 6-3 310 0.8681 39 4 9 Enrolled 1/9/2017\")"
我已经搞乱了其他一些事情,但没有成功。所以我想我的问题是,从这个html列表中获取数据并将其转换为每个数据点都有自己的列(即名称,大学,职位,统计数据等)的格式的最佳方式是什么? ?
答案 0 :(得分:2)
我在代码中修改了一些内容。
采用通用方法来引用css,从而能够提取整个行。
将各列收集为矢量,然后构建数据框
请检查
library(rvest)
library(stringr)
library(tidyr)
link <- "https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank"
link.scrap <- read_html(link)
names <- link.scrap %>% html_nodes('div.name') %>% html_text()
pos <- link.scrap %>% html_nodes('ul.metrics-list') %>% html_text()
status <- link.scrap %>% html_nodes('div.right-content.right') %>% html_text()
data <- data.frame(names,pos,status, stringsAsFactors = F)
data <- data[-1,]
head(data)
> head(data)
names pos status
2 Kamilo Tongamoa Merced College (Merced, CA) DT 6-5 320 Enrolled 8/24/2017
3 Ray Lima El Camino College (Torrance, CA) DT 6-3 310 Enrolled 1/9/2017
4 O'Rien Vance George Washington (Cedar Rapids, IA) OLB 6-3 235 Enrolled 6/12/2017
5 Matt Leo Arizona Western College (Yuma, AZ) WDE 6-7 265 Enrolled 2/22/2017
6 Keontae Jones Colerain (Cincinnati, OH) S 6-1 175 Enrolled 6/12/2017
7 Cordarrius Bailey Clarksdale (Clarksdale, MS) WDE 6-4 210 Enrolled 6/12/2017
>
答案 1 :(得分:0)
基本问题是网页包含看起来像表的内容,但实际上它是一个包含大量样式的列表。这意味着您需要处理每个元素,拉出相关节点并根据需要进一步处理节点内容。
首先,抓住整个列表:
library(dplyr)
library(rvest)
iowa_state <- read_html("https://247sports.com/college/iowa-state/Season/2017-Football/Commits?sortby=rank") %>%
html_nodes('ul.content-list.ri-list')
提取指标(位置,身高,体重)。这会创建一个向量,其中前3个元素是标题(Pos,Ht,Wt),然后每个玩家的指标一次填充其他三个元素。
metrics <- iowa_state %>%
html_nodes("ul.metrics-list li") %>%
html_text() %>%
trimws()
提取状态(&#34;已注册&#34;和日期)。这会创建一个向量,其中&#34;已注册&#34;填充元素1,3,5 ...和日期填充元素2,4,6 ...
status <- iowa_state %>%
html_nodes("p.commit-date") %>%
html_text() %>%
trimws()
现在我们可以逐列构建数据框(或tibble):
iowa_state_df <- tibble(name = iowa_state %>% html_nodes("a.name") %>% html_text(),
college = iowa_state %>% html_nodes("span.meta") %>% html_text() %>% trimws(),
pos = metrics[seq(4, length(metrics)-2, 3)],
ht = metrics[seq(5, length(metrics)-1, 3)],
wt = metrics[seq(6, length(metrics), 3)],
score = iowa_state %>% html_nodes("span.score") %>% html_text(),
natrank = iowa_state %>% html_nodes("div.rank a.natrank") %>% html_text(),
posrank = iowa_state %>% html_nodes("div.rank a.posrank") %>% html_text(),
sttrank = iowa_state %>% html_nodes("div.rank a.sttrank") %>% html_text(),
enrolled = status[seq(1, length(status)-1, 2)],
date = status[seq(2, length(status), 2)]
)
glimpse(iowa_state_df)
Observations: 26
Variables: 11
$ name <chr> "Kamilo Tongamoa", "Ray Lima", "O'Rien Vance", "Matt Leo", "Keontae Jones", "Cordarriu...
$ college <chr> "Merced College (Merced, CA)", "El Camino College (Torrance, CA)", "George Washington ...
$ pos <chr> "DT", "DT", "OLB", "WDE", "S", "WDE", "WR", "CB", "CB", "DUAL", "SDE", "OT", "OT", "WR...
$ ht <chr> "6-5", "6-3", "6-3", "6-7", "6-1", "6-4", "5-11", "6-1", "6-0.5", "6-4", "6-3", "6-5",...
$ wt <chr> "320", "310", "235", "265", "175", "210", "170", "190", "170", "221", "250", "260", "3...
$ score <chr> "0.8742", "0.8681", "0.8681", "0.8656", "0.8624", "0.8546", "0.8515", "0.8482", "0.847...
$ natrank <chr> "28", "39", "508", "48", "587", "724", "806", "885", "924", "928", "929", "NA", "NA", ...
$ posrank <chr> "3", "4", "29", "5", "42", "42", "117", "91", "100", "19", "42", "88", "90", "12", "57...
$ sttrank <chr> "5", "9", "4", "7", "25", "13", "9", "124", "20", "8", "6", "10", "24", "37", "20", "1...
$ enrolled <chr> "Enrolled", "Enrolled", "Enrolled", "Enrolled", "Enrolled", "Enrolled", "Enrolled", "E...
$ date <chr> "8/24/2017", "1/9/2017", "6/12/2017", "2/22/2017", "6/12/2017", "6/12/2017", "6/12/201...
然后,您可以根据需要格式化列的类型(日期,数字等。)。