Question

official Premier league website为各个联盟的球队提供各种统计数据（e.g. this one）。我使用XML R package中的 readHTMLTable 函数来检索这些表。但是，我注意到该功能无法读取五月的表格，而对于其他人来说效果不错。这是一个例子：

april2007.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2006-2007&month=APRIL&timelineView=date&toDate=1177887600000&tableView=CURRENT_STANDINGS"
april.df <- readHTMLTable(april2007.url, which = 1)
april.df[complete.cases(april.df),] ## correct table


march2014.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2013-2014&month=APRIL&timelineView=date&toDate=1398639600000&tableView=CURRENT_STANDINGS"
march.df <- readHTMLTable(march2014.url, which = 1)
march.df[complete.cases(march.df), ] ## correct table

may2007.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2006-2007&month=MAY&timelineView=date&toDate=1179010800000&tableView=CURRENT_STANDINGS"
may.df1 <- readHTMLTable(may2007.url, which = 1)
may.df1 ## Just data for the first team

may2014.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2013-2014&month=MAY&timelineView=date&toDate=1399762800000&tableView=CURRENT_STANDINGS"
may.df2 <- readHTMLTable(may2014.url, which =1)
may.df2 ## Just data for the first team

如您所见，该函数无法检索五月份的数据。

请有人解释为什么会发生这种情况以及如何解决这个问题？

编辑后@zyurnaidi回复：

以下是无需手动编辑即可完成工作的代码。

url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2009-2010&month=MAY&timelineView=date&toDate=1273359600000&tableView=CURRENT_STANDINGS" ## data for the 09-05-2010.

con <- file (url)
raw <- readLines (con)
close (con)

pattern <- '<span class=" cupchampions-league= competitiontooltip= qualifiedforuefachampionsleague=' ## it seems that this part of the webpage source code mess the things up

raw <- gsub (pattern = pattern, replacement = '""', x = raw)

df <- readHTMLTable (doc = raw, which = 1)
df[complete.cases(df), ] ## correct table

Answer 1

行。我在这里找到问题的提示很少：
这些问题在5月份始终如一。这是每个赛季的最后一个月。这意味着在这种特殊情况下应该有一些独特的东西 2.直接解析（htmlParse，来自链接和下载文件）生成截断文件。在报告表中的第一个团队后，表和html文件突然关闭。

在此之后，解析后的数据总是与原始数据不同：

<span class=" cupchampions-league=

下载并仔细检查html文件后，我发现那里有（未编码的？）字符问题。我的猜测，这是由球队名字后看到的可爱的小奖杯图标引起的。

无论如何，要解决此问题，您需要删除这些错误字符。我建议不是编辑下载的html文件，而是： 1. View page source 5月份排行榜的EPL网址 2.复制全部并粘贴到文本编辑器，另存为html文件
3.您现在可以使用htmlParse或readHTMLTable

可能有更好的方法来自动化，但希望它可以提供帮助。

为什么readHTMLTable无法成功阅读5月份的高级联赛牌桌？

1 个答案: