Question

我正在尝试从 rotoguru1.com 抓取与梦幻足球运动员工资相关的数据。我试图从中收集数据的示例网页可以在这里找到：http://rotoguru1.com/cgi-bin/fyday.pl?week=1&year=2014&game=dk&scsv=1。数据可以方便地以 scsv 格式出现在每个页面的 html“pre”标签下。我首先使用 for 循环来生成我想要从中抓取数据的所有 url，但我随后很难将这些网页中的所有数据转换为我想要的格式，即包含所有抓取数据的最终数据表。我使用第二个 for 循环遍历所有 url，在每个页面上使用 read_html() 函数，然后使用 html_nodes('pre')%>%html_text() 提取感兴趣的数据。问题是，由于我的代码当前有效，这只是为包含整个 scsv 作为单个对象的每个页面创建一个大对象，而不是作为包含单个列（周、年、gid、名称、位置、团队、h/a、opt、dk 点数、dk 薪水）。相反，我想要一个数据表，其中包含我尝试抓取的所有页面的这些单独列，但对网络抓取没有太多经验，也不知道如何解决此问题。任何帮助将不胜感激。以下是我迄今为止编写的代码：

library(purrr) 
library(rvest)
library(data.table)
library(stringr)
library(tidyr)


#Declare variables and empty data tables
path1<-("http://rotoguru1.com/cgi-bin/fyday.pl?week=")
seasons<-c("2014", "2015", "2016","2017","2018","2019","2020")
weeks<-1:17
result<-NULL
temp<-NULL

#Use nested for loops to get the url, season, and week for each webpage of interest, store in result data table
for(s in 1:length(seasons)){
  for(w in 1:length(weeks)){
    temp<- paste0(path1, as.character(w),"&year=",seasons[s],"&game=dk&scsv=1")
    result<-rbind(result,temp)
  }
}

#Get rid of any potential empty values from result
result<-compact(result) 

final<-data.table()
#Create final data table with all injury information
for (i in 1:length(result)){
  page<-read_html(result[i])
  data<-page%>%html_nodes("pre")%>%html_text()
  final<-rbind(data,final)
  
}

Answer 1

我相信第一个 for-loop cn 中的整个代码将替换为以下（主要是 data.table）解决方案：

result <- CJ(seasons, weeks)[, paste0(path1, weeks, "&year=", seasons, "&game=dk&scsv=1") ]
#loop over result
final <- data.table::rbindlist(
  lapply( result, function(x) {
    read_html(x) %>%
      html_nodes("pre") %>% 
      html_text() %>%
      data.table::fread( sep = ";" ) # <-- !!
    } ),
  use.names = TRUE, fill = TRUE )

Answer 2

页面具有获取 html 表格格式的选项，因此您可以使用 "&game=dk&scsv=1"

代替循环中的 "&game=dk"

然后只用 html_table

这是一个页面的例子

page<-read_html(result[1])

x<-data.frame(page%>%html_nodes("table")  %>%  `[`(9) %>% html_table(T))
colnames(x)  <- as.character(x[1,])
x <- x[-1,]

努力将单列中的抓取数据转换为正确的表格格式

2 个答案: