R for webscraping - 拉价格和名称

时间:2014-08-27 20:00:01

标签: xml r xpath xml-parsing html-parsing

我正在尝试从下面的网址中获取Steam网站的价格和游戏名称列表,但无法弄清楚xpathSApply应如何解析以下内容:

        http://store.steampowered.com/search/?sort_by=Price&sort_order=ASC&';">Price

这是我的代码

require(RCurl)
require(XML)
url <- "http://store.steampowered.com/search/results?sort_by=Name&sort_order=ASC&category1=1"
SOURCE <-  getURL(url,encoding="UTF-8") #Download the page
substring (SOURCE,1,200)
PARSED <- htmlParse(SOURCE) #Format the html code 
##My problem is in this line below 
(xpathSApply(PARSED, "//div[@class='col search_price']"))

1 个答案:

答案 0 :(得分:3)

试试这个:

require(RCurl)
require(XML)
url <- "http://store.steampowered.com/search/?sort_by=Metascore&sort_order=DESC&"
SOURCE <-  getURL(url, encoding="UTF-8") #Download the page
PARSED <- htmlParse(SOURCE, asText = TRUE, encoding = "utf-8")
xpaths <- c(price="//a/div[@class='col search_price']", 
            title="//div[@class='col search_name ellipsis']/h4")
res <- sapply(xpaths, function(x) xpathSApply(PARSED, x, xmlValue, trim = TRUE) )
head(res)
#      price    title                        
# [1,] "9,99€"  "Half-Life 2"                
# [2,] "9,99€"  "Half-Life"                  
# [3,] "19,99€" "BioShock™"                  
# [4,] "18,99€" "The Orange Box"             
# [5,] "19,99€" "Portal 2"                   
# [6,] "14,99€" "The Elder Scrolls V: Skyrim"