R Webscrape功能 - 索引向量仅返回1个结果

时间:2014-03-19 17:10:07

标签: r web-scraping

我有一个我从循环调用的函数。基本思想是加载URL列表并构建一个数据框,其中每一行都是一个URL,每列都是我有兴趣抓取的属性。当我最初运行它时,我没有包含索引(末尾的方括号),它工作正常,直到我点击页面上有多个元素的URL。因此,我将其更改为下面的代码,并且没有出现错误,但无论我提供多少个URL,我的数据框中只有一行。

    require(RCurl)
    require(XML)

    scrp.getDtls <- function(url){
      src = getURL(url,encoding="UTF-8")
      prsd = htmlParse(src)
      title = xpathSApply(prsd, "//meta[@name='title']/@content")[1] #added to return first element only
      brand = xpathSApply(prsd, "//meta[@itemprop='brand']/@content")[1]
      model = xpathSApply(prsd, "//meta[@itemprop='model']/@content")[1]
      upc = xpathSApply(prsd, "//meta[@itemprop='productID']/@content")[1]
      price = xpathSApply(prsd, "//div/meta[@itemprop='price']/@content")[1]
      x = data.frame(title,brand,model,upc,price)
    }

    urls = read.csv("urls.csv", header=FALSE)

    x = NA
    for(url in urls){
      x = rbind(x,scrp.getDtls(url))
    }

    x = x[-1,]
    View(x)

    #CSV file partial contents
    "http://www.walmart.com/ip/Suave-Naturals-Ocean-Breeze-Shampoo-22.5-oz/10293577"
    "http://www.walmart.com/ip/Gillette-Fusion-Cartridges-4-count/14071267"
    "http://www.walmart.com/ip/Sensodyne-Pronamel-Mint-Essence-Toothpaste-4-oz/10316819"
    "http://www.walmart.com/ip/Speed-Stick-Ocean-Surf-Deodorant-3-oz/11965072"

谢谢:)

1 个答案:

答案 0 :(得分:0)

这样做你想要的吗?

require(RCurl)
require(XML)

功能定义

scrp_getdtls <- function(url){
  src = getURL(url,encoding="UTF-8")
  prsd = htmlParse(src)
  title = xpathSApply(prsd, "//meta[@name='title']/@content")[1] #added to return first element only
  brand = xpathSApply(prsd, "//meta[@itemprop='brand']/@content")[1]
  model = xpathSApply(prsd, "//meta[@itemprop='model']/@content")[1]
  upc = xpathSApply(prsd, "//meta[@itemprop='productID']/@content")[1]
  price = xpathSApply(prsd, "//div/meta[@itemprop='price']/@content")[1]
  data.frame(title,brand,model,upc,price)
}

网址

urls <- c("http://www.walmart.com/ip/Suave-Naturals-Ocean-Breeze-Shampoo-22.5-oz/10293577",
"http://www.walmart.com/ip/Gillette-Fusion-Cartridges-4-count/14071267",
"http://www.walmart.com/ip/Sensodyne-Pronamel-Mint-Essence-Toothpaste-4-oz/10316819",
"http://www.walmart.com/ip/Speed-Stick-Ocean-Surf-Deodorant-3-oz/11965072")

使用lapply将每个网址传递给该功能,并使用rbind do.call组合行。

out <- lapply(urls, scrp_getdtls)
do.call(rbind, out)

##                                                     title       brand
## content      Suave Naturals Ocean Breeze Shampoo, 22.5 oz       Suave
## content1              Gillette Fusion Cartridges, 4 count    Gillette
## content2 Sensodyne Pronamel Mint Essence Toothpaste, 4 oz   Sensodyne
## content3           Speed Stick Ocean Surf Deodorant, 3 oz Speed Stick
##             model          upc price
## content     89280 079400832801   1.5
## content1 SFS ONLY 047400156579 15.97
## content2    83050 310158830504  4.92
## content3    93008 022200930086  1.98