我使用下面的rvest代码:
library(rvest)
URL <- "http://www.soccerstats.com/matches.asp" #Feed page
WS <- read_html (URL) #reads webpage into WS variable
URLs <- WS %>% html_nodes ("a:nth-child(1)") %>% html_attr("href") %>% as.character() # Get the CSS nodes & extract the URLs
URLs <- paste0("http://www.soccerstats.com/",URLs)
grepl("pmatch", oversdf$URLs)
URLs <-subset(oversdf, grepl("pmatch", oversdf$URLs),stringsAsFactors = FALSE)
Catcher1 <- data.frame(FMatch=character(),TotalGoals=character (),stringsAsFactors = FALSE)
#Start of for loop
for (i in URLs) {
WS1 <- read_html(i)
FMatch <- WS1 %>% html_nodes("H1") %>% html_text() %>% as.character()
TotalGoals <- WS1 %>% html_nodes(".trow3+ .trow2 td~ td+ td font b") %>% html_text() %>% as.character()
temp <- data.frame(FMatch,TotalGoals)
Catcher1 <- rbind(Catcher1,temp)
cat("*")
}
当它尝试运行循环时,我收到错误:
UseMethod(“read_xml”)中的错误:没有适用于“read_xml”的方法应用于类“factor”的对象
查看论坛帖子我需要使用stringsAsFactors = FALSE,因为我的数据帧会将字段数据存储为因子而不是字符串。
我唯一可以认为剩下的就是临时df:
temp&lt; - data.frame(FMatch,TotalGoals)
但是我尝试将它应用于上述df语法错误,任何想法?
(显然我是新手,所以我可能错了以上是什么导致错误,看起来好像是我读过的各种论坛帖子)
干杯
答案 0 :(得分:3)
基本上问题是在每个循环中设置HTML环境。为了解决这个问题,我在每个循环的开头使用html_session()
并将其反馈给html_nodes()
:
for (i in URLs) {
WS1 <- html_session(i)
FMatch <- html_nodes(WS1, "h1") %>% html_text() %>% as.character()
TotalGoals <- html_nodes(WS1, ".trow3+ .trow2 td~ td+ td font b") %>% html_text() %>% as.character()
temp <- data.frame(FMatch,TotalGoals)
Catcher1 <- rbind(Catcher1,temp)
cat("*")
}
返回:
R> Catcher1
FMatch TotalGoals
1 Santa Cruz vs Criciuma 2.07
2 FC Kiffen vs FC KTP 3.08
3 Furth B vs Augsburg B 2.00
4 Vikingur R. vs IBV 3.67
5 IA Akranes vs KR Reykjavik 4.00
6 Hafnarfjordur vs Valur 2.84
7 Valerenga B vs Skeid 2.25
8 Constanta vs Voluntari 1.75
9 Syrianska vs Norrby 3.46
10 Osters IF vs GAIS 3.14
11 Sleipner vs Linkoping City 2.94
答案 1 :(得分:0)
实际上我看到两种方法都有效,并做了一些调整。第一个问题是在初始代码[html_nodes(“H1”)]中指示“H1”而不是真正的“h1”。所以这必须得到纠正。
第二个问题与错误类型的“URL”有关。即,它具有以下属性:
> typeof(URLs)
[1] "list"
> length(URLs)
[1] 1
与此同时,我希望它是:
> typeof(URLs)
[1] "character"
> length(URLs)
[1] 10 #or some other number
因此我做了以下解决方法:
n<-nrow(URLs)
URLs2<-character()
for (i in 1:n) {
URLs2[i]<-as.character(URLs[i,1])
}
这使得初始版本正常运行。这是完整的代码(顺便说一句,谢谢你一个很好的例子):
library(rvest)
URL <- "http://www.soccerstats.com/matches.asp" #Feed page
WS <- read_html (URL) #reads webpage into WS variable
URLs <- WS %>% html_nodes ("a:nth-child(1)") %>% html_attr("href") %>% as.character() # Get the CSS nodes & extract the URLs
URLs <- paste0("http://www.soccerstats.com/",URLs)
oversdf <- data.frame(URLs=URLs)
rownames(oversdf) #returns a vector of row names in the overs data.frame:
URLs <-subset(oversdf, grepl("pmatch", oversdf$URLs),stringsAsFactors = FALSE)
write.csv(URLs,file=paste(getwd(),"/sportURLs.csv",sep=""),row.names=FALSE)
Catcher1 <- data.frame(FMatch=character(),TotalGoals=character (),stringsAsFactors = FALSE)
##################################
#start of workaround
n<-nrow(URLs)
URLs2<-character()
for (i in 1:n) {
URLs2[i]<-as.character(URLs[i,1])
}
#Start of for loop
for (i in URLs2) {
#end of workaround
#######################################
WS1 <- read_html(i)
FMatch <- WS1 %>% html_nodes("h1") %>% html_text() %>% as.character()
TotalGoals <- WS1 %>% html_nodes(".trow3+ .trow2 td~ td+ td font b") %>% html_text() %>% as.character()
temp <- data.frame(FMatch,TotalGoals)
Catcher1 <- rbind(Catcher1,temp)
cat("*")
}
返回:
> Catcher1
FMatch TotalGoals
1 Dep. Espanol vs Comunicaciones 2.22
2 San Martín B. vs Ituzaingó 1.77
3 Leandro N. Alem vs Def. Unidos 2.03
4 Dep. Laferre vs Central Córdoba 2.44
5 J.J. Urquiza vs Sport. Italiano 2.53
6 Excursionistas vs Berazategui 2.56
7 Dock Sud vs Midland 1.74
8 Dep. Armenio vs Luján 1.47
#and so on