Question

我正在尝试获取“已过期日期”和“更新日期”as pictured here的值。网站网址为：http://sulit.com.ph/3991016

我有一种感觉，我应该使用xpathSApply，正如此线程Web Scraping (in R?)中所建议的那样，但我无法让它工作。

url = "http://sulit.com.ph/3991016"
doc = htmlTreeParse(url, useInternalNodes = T)

date_posted = xpathSApply(doc, "??????????", xmlValue)

也有人知道快速获取网站上列出的“P27M”这个词吗？帮助将不胜感激。

Answer 1

这是另一种方法。

> require(XML)
> 
> url = "http://www.sulit.com.ph/index.php/view+classifieds/id/3991016/BEAUTIFUL+AYALA+HEIGHTS+QC+HOUSE+FOR+SALE"
> doc = htmlParse(url)
> 
> dates = getNodeSet(doc, "//span[contains(string(.), 'Date Posted') or contains(string(.), 'Date Updated')]")
> dates = lapply(dates, function(x){
+         temp = xmlValue(xmlParent(x)["span"][[2]])
+         strptime(gsub("^[[:space:]]+|[[:space:]]+$", "", temp), format = "%B %d, %Y")
+ 
+ })
> dates
[[1]]
[1] "2012-07-05"

[[2]]
[1] "2011-08-11"

没有必要使用RCurl，因为htmlParse会解析网址。 getNodeSet将返回一个列表，其中包含“Date Posted”或“Date Updated”作为值的节点。 lapply在这两个节点上循环，首先找到父节点，然后找到第二个“span”节点的值。如果网站改变了不同页面的格式（在查看该网站的html之后似乎非常可能），这部分可能不是很强大。 SlowLearner的gsub清理了两个日期。我添加了strptime以将日期作为日期类返回，但该步骤是可选的，取决于您计划将来如何使用该信息。 HTH

Answer 2

这不是很优雅，可能不是很强大，但它适用于这种情况。

require调用后的前4行检索URL并提取文本。 grep返回TRUE或FALSE，具体取决于是否找到了我们要查找的字符串，which将其转换为列表中的索引。我们将此增加1，因为如果查看cleantext，您将看到更新的日期是字符串“Date Updated”后列表中的下一个元素。因此+1在“更新日期”之后获取了元素。 gsub行只是清理字符串。

“P27M”的问题在于它没有固定在任何东西上 - 它只是在任意位置漂浮的自由文本。如果您确定价格始终是“P”后跟1到3位数字，后跟“M”并且您在页面中只有一个这样的字符串，则grep或regex将起作用，否则很难得到。

require(XML)
require(RCurl)

myurl <- 'http://www.sulit.com.ph/index.php/view+classifieds/id/3991016/BEAUTIFUL+AYALA+HEIGHTS+QC+HOUSE+FOR+SALE'
mytext <- getURL(myurl)
myhtml <- htmlTreeParse(mytext, useInternal = TRUE)
cleantext <- xpathApply(myhtml, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue)

cleantext <- cleantext[!cleantext %in% " "]
cleantext <- gsub("  "," ", cleantext)

date_updated <- cleantext[[which(grepl("Date Updated",cleantext))+1]]
date_posted <- cleantext[[which(grepl("Date Posted",cleantext))+1]]
date_posted <- gsub("^[[:space:]]+|[[:space:]]+$","",date_posted)
date_updated <- gsub("^[[:space:]]+|[[:space:]]+$","",date_updated)

print(date_updated)
print(date_posted)

网络抓取在R

2 个答案: