我在编写简短的R代码方面有一些经验,但实际上我并不擅长它。我想知道你是否可以帮我优化我的R代码,以便从德国杂志Kicker查询足球比分和机会。
您可以从此链接(2012/13赛季)获得所有结果和分数:
http://www.kicker.de/news/fussball/bundesliga/spieltag/1-bundesliga/2012-13/-1/0/spieltag.html
另外,我需要每场比赛的机会。您只需按照链接"分析"你会发现" Spielinfo"页面右下角的部分。
然而,在我的代码中,我首先必须采用这种方式。
之后进入分析'部分(用spielanfos代替spielinfo)
我的代码有效,但似乎存在瓶颈,我还没有找到任何解决方案来提高性能。也许你可以帮助我更好地理解我的剧本。
我写了一个这样的函数:
library(XML)
library(stringr)
library(plyr)
kicker.daten <- function(saison)
{
# define the url ('base url')
url <- paste("http://www.kicker.de/news/fussball/bundesliga/spieltag/1-bundesliga/", saison, "/-1/0/spieltag.html", sep = "")
# save source code for parsing
doc <- htmlParse(url, isURL = TRUE, encoding = 'UTF-8')
# Query document to extract the teams (home, away)
# btw, really difficult to find the path
spiele <- xpathApply(doc, "//div[@id='ctl00_PlaceHolderContent_begegnungenCtrl']//table[@class='tStat']//div/a[@class='link']", xmlValue, "class")
# convert to vector, don't know why but it works
spiele <- unlist(spiele)
# separating home and away teams
heim <- character()
gast <- character()
for (i in 1 : (length(spiele)/2))
{
heim[i] <- spiele[2*i-1]
gast[i] <- spiele[2*i]
}
# query document to extract results
ergebnis <- xpathApply(doc, "//div[@id='ctl00_PlaceHolderContent_begegnungenCtrl']//table[@class='tStat']//td[@class='alignright']", xmlValue, "class")
# again convert to vector
ergebnis <- unlist(ergebnis)
# text processing to get the results
ergebnis <- substring(ergebnis, 1, str_locate(ergebnis, '\\s')[,1]-1)
# extracting scores for home and away team
th <- as.numeric(substring(ergebnis, 1, str_locate(ergebnis, ':')[,1]-1))
tg <- as.numeric(substring(ergebnis, str_locate(ergebnis, ':')[,1]+1), str_locate(ergebnis, '-:-')[,1]+1)
# extract the links from 'base url' to get all the pages / url that contain the chances
# unfortunately, I first had to go via 'Spielinfo'
links <- xpathApply(doc, "//td[@class='aligncenter']/a", xmlGetAttr, "href")
# convert to vector
# substitute spielinfo with spielanalyse to extraxt the chances
links <- unlist(links)
links <- sub('spielinfo', 'spielanalyse', links)
# define urls
sub_url <- character()
for(i in 1:length(links))
{
sub_url[i] <- paste("http://www.kicker.de", links[i], sep = "")
}
# save source code in sub-documents
sub_doc <- llply(sub_url, htmlParse, .progress = "text", .inform = T)
# query every subdocument
# I guess the bottleneck is here!!!
chancen <- list()
for (i in 1:length(sub_doc))
{
chancen[[i]] <- xpathSApply(sub_doc[[i]], "//div[@id='ctl00_PlaceHolderHalf_ctl03_chancen']//div[@class='wert']", xmlValue, "class")
}
# convert to vector
chancen <- unlist(chancen)
# separate chances for home and away team
ch <- as.numeric(substring(chancen, 1, str_locate(chancen, ':')[,1]-1))
cg <- as.numeric(substring(chancen, str_locate(chancen, ':')[,1]+1), str_locate(chancen, ':')[,1]+1)
# first dataframe to exclude missing values
df <- na.omit(data.frame(cbind(heim, gast, ergebnis, th, tg), stringsAsFactors = F))
# season column
saison <- rep(saison, 306)
# daY of match
spieltag <- as.numeric(sort(rep(1:34, times = 9)))
# final dataframe
result <- data.frame(cbind(saison = saison, spieltag = spieltag, heim = df$heim, gast = df$gast, ergebnis = df$ergebnis, chancen = chancen, th = as.integer(df$th), tg = as.integer(df$tg), ch = as.integer(ch)), cg = as.integer(cg), stringsAsFactors = F)
return(result)
}
# All results for the season 2012/13
kicker.daten("2012-13")
如果我的评论有误导性,请告诉我。说实话,我真的不知道代码到底是做什么的。
谢谢,
拉米