将现场奥运奖牌数据下载到R

时间:2012-07-29 19:41:18

标签: r

看起来网站阻止了Curl的直接访问。

library(XML) 
library(RCurl) 
theurl <- "http://www.london2012.com/medals/medal-count/"
page <- getURL(theurl)

page # fail
[1] "<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don't have permission to access \"http&#58;&#47;&#47;www&#46;london2012&#46;com&#47;medals&#47;medal&#45;count&#47;\" on this server.<P>\nReference&#32;&#35;18&#46;358a503f&#46;1343590091&#46;c056ae2\n</BODY>\n</HTML>\n"

让我们试着看看我们是否可以直接从表中访问它。

page <- readHTMLTable(theurl)

那里没有运气Error in htmlParse(doc) : error in creating parser for http://www.london2012.com/medals/medal-count/

你如何将这张桌子放入R?


更新:响应评论和玩弄,伪造用户代理字符串以获取内容。但readHTMLtable返回错误。

page <- getURLContent(theurl, useragent="Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120716 Firefox/15.0a2")

2 个答案:

答案 0 :(得分:12)

看起来这样有效:

rr <- readHTMLTable(page,header=FALSE)
rr2 <- setNames(rr[[1]],
                c("rank","country","gold","silver","bronze","junk","total"))
rr3 <- subset(rr2,select=-junk)
## oops, numbers all got turned into factors ...
tmpf <- function(x) { as.numeric(as.character(x)) }
rr3[,-2] <- sapply(rr3[,-2],tmpf)               
head(rr3)
##   rank                                country gold silver bronze total
## 1    1             People's Republic of China    6      4      2    12
## 2    2               United States of America    3      5      3    11
## 3    3                                  Italy    2      3      2     7
## 4    4                      Republic of Korea    2      1      2     5
## 5    5                                 France    2      1      1     4
## 6    6 Democratic People's Republic  of Korea    2      0      1     3
with(rr3,dotchart(total,country))

答案 1 :(得分:12)

这是我使用正则表达式提出的。非常具体,绝对不比使用其他答案中使用的readHTMLTable更好。更多内容表明你可以在R:

中使用textmining
# file <- "~/Documents/R/medals.html"
# page <- readChar(file,file.info(file)$size)

library(RCurl) 
theurl <- "http://www.london2012.com/medals/medal-count/"
page <- getURLContent(theurl, useragent="Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120716 Firefox/15.0a2")


# Remove html tags:
page <- gsub("<(.|\n)*?>","",page)
# Remove newlines and tabs:
page <- gsub("\\n","",page)

# match table:
page <- regmatches(page,regexpr("(?<=Total).*(?=Detailed)",page,perl=TRUE))

# Extract country+medals+rank
codes <-regmatches(page,gregexpr("\\d+[^\\r]*\\d+",page,perl=TRUE))[[1]]
codes <- codes[seq(1,length(codes)-2,by=2)]

# Extract country and medals:
Names <- gsub("\\d","",codes)
Medals <- sapply(regmatches(codes,gregexpr("\\d",codes)),function(x)x[(length(x)-2):length(x)])

# Create data frame:
data.frame(
  Country = Names,
  Gold = as.numeric(Medals[1,]),
  Silver = as.numeric(Medals[2,]),
  Bronze = as.numeric(Medals[3,]))

输出:

                                  Country Gold Silver Bronze
1              People's Republic of China    6      4      2
2                United States of America    3      5      3
3                                   Italy    2      3      2
4                       Republic of Korea    2      1      2
5                                  France    2      1      1
6  Democratic People's Republic  of Korea    2      0      1
7                              Kazakhstan    2      0      0
8                               Australia    1      1      1
9                                  Brazil    1      1      1
10                                Hungary    1      1      1
11                            Netherlands    1      1      0
12                     Russian Federation    1      0      3
13                                Georgia    1      0      0
14                           South Africa    1      0      0
15                                  Japan    0      2      3
16                          Great Britain    0      1      1
17                               Colombia    0      1      0
18                                   Cuba    0      1      0
19                                 Poland    0      1      0
20                                Romania    0      1      0
21                Taipei (Chinese Taipei)    0      1      0
22                             Azerbaijan    0      0      1
23                                Belgium    0      0      1
24                                 Canada    0      0      1
25                    Republic of Moldova    0      0      1
26                                 Norway    0      0      1
27                                 Serbia    0      0      1
28                               Slovakia    0      0      1
29                                Ukraine    0      0      1
30                             Uzbekistan    0      0      1