如何提取满足R中条件的超链接

时间:2017-04-12 20:05:58

标签: r web-scraping

我通过R

提取页面源信息
download.file("http://stats.espncricinfo.com/ci/engine/records/team/match_results_year.html?class=2;id=6;type=team",
              "dataDictionary.html")
docHtml = htmlTreeParse("dataDictionary.html", useInternal = TRUE) # to Download the page source 
links <- xpathSApply(docHtml,path = "//a", xmlGetAttr, "href")

现在我需要提取类似"/ci/engine/records/team/match_results.html?class=2;id= *"的数据。这里*无论满足这个条件,这些数据都必须存储在另一个变量中。有什么帮助吗?

1 个答案:

答案 0 :(得分:1)

您可以使用grep

检测您感兴趣的所有链接
GoodLinks = grep("/ci/engine/records/team/match_results.html\\?class=2;id", links)

如果您只想要id字段,则可以使用sub

处理这些链接
sub(".*id=(\\d+).*", "\\1", links[GoodLinks])
[1] "1974" "1975" "1976" "1978" "1979" "1980" "1981" "1982" "1983" "1984" "1985" "1986" "1987" "1988" "1989" "1990"
[17] "1991" "1992" "1993" "1994" "1995" "1996" "1997" "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005" "2006"
[33] "2007" "2008" "2009" "2010" "2011" "2012" "2013" "2014" "2015" "2016" "2017"