如何优化我的R代码用于网络抓取德国德甲足球数据

时间:2013-07-20 10:54:16

标签: xml r optimization web-scraping

我在编写简短的R代码方面有一些经验,但实际上我并不擅长它。我想知道你是否可以帮我优化我的R代码,以便从德国杂志Kicker查询足球比分和机会。

您可以从此链接(2012/13赛季)获得所有结果和分数:

http://www.kicker.de/news/fussball/bundesliga/spieltag/1-bundesliga/2012-13/-1/0/spieltag.html

另外,我需要每场比赛的机会。您只需按照链接"分析"你会发现" Spielinfo"页面右下角的部分。

然而,在我的代码中,我首先必须采用这种方式。

http://www.kicker.de/news/fussball/bundesliga/spieltag/1-bundesliga/2012-13/1/1441337/spielinfo_borussia-dortmund-17_werder-bremen-4.html

之后进入分析'部分(用spielanfos代替spielinfo)

我的代码有效,但似乎存在瓶颈,我还没有找到任何解决方案来提高性能。也许你可以帮助我更好地理解我的剧本。

我写了一个这样的函数:

library(XML)
library(stringr)
library(plyr)


kicker.daten <- function(saison)
{
  # define the url ('base url')
  url <- paste("http://www.kicker.de/news/fussball/bundesliga/spieltag/1-bundesliga/", saison, "/-1/0/spieltag.html", sep = "")

  # save source code for parsing
  doc <- htmlParse(url, isURL = TRUE, encoding = 'UTF-8')

  # Query document to extract the teams (home, away)
  # btw, really difficult to find the path
  spiele <- xpathApply(doc, "//div[@id='ctl00_PlaceHolderContent_begegnungenCtrl']//table[@class='tStat']//div/a[@class='link']", xmlValue, "class")

  # convert to vector, don't know why but it works
  spiele <- unlist(spiele)

  # separating home and away teams
  heim <- character()
  gast <- character()

  for (i in 1 : (length(spiele)/2))
  {
    heim[i] <- spiele[2*i-1]
    gast[i] <- spiele[2*i]
  }

  # query document to extract results
  ergebnis <- xpathApply(doc, "//div[@id='ctl00_PlaceHolderContent_begegnungenCtrl']//table[@class='tStat']//td[@class='alignright']", xmlValue, "class")

  # again convert to vector
  ergebnis <- unlist(ergebnis)

  # text processing to get the results
  ergebnis <- substring(ergebnis, 1, str_locate(ergebnis, '\\s')[,1]-1)

  # extracting scores for home and away team
  th <- as.numeric(substring(ergebnis, 1, str_locate(ergebnis, ':')[,1]-1))
  tg <- as.numeric(substring(ergebnis, str_locate(ergebnis, ':')[,1]+1), str_locate(ergebnis, '-:-')[,1]+1)

  # extract the links from 'base url' to get all the pages / url that contain the chances
  # unfortunately, I first had to go via 'Spielinfo'
  links <- xpathApply(doc, "//td[@class='aligncenter']/a", xmlGetAttr, "href")

  # convert to vector
  # substitute spielinfo with spielanalyse to extraxt the chances
  links <- unlist(links)
  links <- sub('spielinfo', 'spielanalyse', links)

  # define urls
  sub_url <- character()
  for(i in 1:length(links))
  {
    sub_url[i] <- paste("http://www.kicker.de", links[i], sep = "")
  }

  # save source code in sub-documents
  sub_doc <- llply(sub_url, htmlParse, .progress = "text", .inform = T)

  # query every subdocument
  # I guess the bottleneck is here!!!
  chancen <- list()
  for (i in 1:length(sub_doc)) 
  {
    chancen[[i]] <- xpathSApply(sub_doc[[i]], "//div[@id='ctl00_PlaceHolderHalf_ctl03_chancen']//div[@class='wert']", xmlValue, "class")
  }

  # convert to vector
  chancen <- unlist(chancen)

  # separate chances for home and away team
  ch <- as.numeric(substring(chancen, 1, str_locate(chancen, ':')[,1]-1))
  cg <- as.numeric(substring(chancen, str_locate(chancen, ':')[,1]+1), str_locate(chancen, ':')[,1]+1)

  # first dataframe to exclude missing values
  df <- na.omit(data.frame(cbind(heim, gast, ergebnis, th, tg), stringsAsFactors = F))

  # season column
  saison <- rep(saison, 306)

  # daY of match
  spieltag <- as.numeric(sort(rep(1:34, times = 9)))

  # final dataframe
  result <- data.frame(cbind(saison = saison, spieltag = spieltag, heim = df$heim, gast = df$gast, ergebnis = df$ergebnis, chancen = chancen, th = as.integer(df$th), tg = as.integer(df$tg), ch = as.integer(ch)), cg = as.integer(cg), stringsAsFactors = F)
  return(result)
}

# All results for the season 2012/13

kicker.daten("2012-13")

如果我的评论有误导性,请告诉我。说实话,我真的不知道代码到底是做什么的。

谢谢,

拉​​米

0 个答案:

没有答案