我有一个我从循环调用的函数。基本思想是加载URL列表并构建一个数据框,其中每一行都是一个URL,每列都是我有兴趣抓取的属性。当我最初运行它时,我没有包含索引(末尾的方括号),它工作正常,直到我点击页面上有多个元素的URL。因此,我将其更改为下面的代码,并且没有出现错误,但无论我提供多少个URL,我的数据框中只有一行。
require(RCurl)
require(XML)
scrp.getDtls <- function(url){
src = getURL(url,encoding="UTF-8")
prsd = htmlParse(src)
title = xpathSApply(prsd, "//meta[@name='title']/@content")[1] #added to return first element only
brand = xpathSApply(prsd, "//meta[@itemprop='brand']/@content")[1]
model = xpathSApply(prsd, "//meta[@itemprop='model']/@content")[1]
upc = xpathSApply(prsd, "//meta[@itemprop='productID']/@content")[1]
price = xpathSApply(prsd, "//div/meta[@itemprop='price']/@content")[1]
x = data.frame(title,brand,model,upc,price)
}
urls = read.csv("urls.csv", header=FALSE)
x = NA
for(url in urls){
x = rbind(x,scrp.getDtls(url))
}
x = x[-1,]
View(x)
#CSV file partial contents
"http://www.walmart.com/ip/Suave-Naturals-Ocean-Breeze-Shampoo-22.5-oz/10293577"
"http://www.walmart.com/ip/Gillette-Fusion-Cartridges-4-count/14071267"
"http://www.walmart.com/ip/Sensodyne-Pronamel-Mint-Essence-Toothpaste-4-oz/10316819"
"http://www.walmart.com/ip/Speed-Stick-Ocean-Surf-Deodorant-3-oz/11965072"
谢谢:)
答案 0 :(得分:0)
这样做你想要的吗?
require(RCurl)
require(XML)
功能定义
scrp_getdtls <- function(url){
src = getURL(url,encoding="UTF-8")
prsd = htmlParse(src)
title = xpathSApply(prsd, "//meta[@name='title']/@content")[1] #added to return first element only
brand = xpathSApply(prsd, "//meta[@itemprop='brand']/@content")[1]
model = xpathSApply(prsd, "//meta[@itemprop='model']/@content")[1]
upc = xpathSApply(prsd, "//meta[@itemprop='productID']/@content")[1]
price = xpathSApply(prsd, "//div/meta[@itemprop='price']/@content")[1]
data.frame(title,brand,model,upc,price)
}
网址
urls <- c("http://www.walmart.com/ip/Suave-Naturals-Ocean-Breeze-Shampoo-22.5-oz/10293577",
"http://www.walmart.com/ip/Gillette-Fusion-Cartridges-4-count/14071267",
"http://www.walmart.com/ip/Sensodyne-Pronamel-Mint-Essence-Toothpaste-4-oz/10316819",
"http://www.walmart.com/ip/Speed-Stick-Ocean-Surf-Deodorant-3-oz/11965072")
使用lapply
将每个网址传递给该功能,并使用rbind
do.call
组合行。
out <- lapply(urls, scrp_getdtls)
do.call(rbind, out)
## title brand
## content Suave Naturals Ocean Breeze Shampoo, 22.5 oz Suave
## content1 Gillette Fusion Cartridges, 4 count Gillette
## content2 Sensodyne Pronamel Mint Essence Toothpaste, 4 oz Sensodyne
## content3 Speed Stick Ocean Surf Deodorant, 3 oz Speed Stick
## model upc price
## content 89280 079400832801 1.5
## content1 SFS ONLY 047400156579 15.97
## content2 83050 310158830504 4.92
## content3 93008 022200930086 1.98