我有一堆网页,我想提取他们的发布日期。 对于某些网页,日期位于" abbr"标签(如:abbr class = \"已发布\" title = \" 2012-03-14T07:13:39 + 00:00 \"> 2012-03-14 ,7:13"),我能够使用以下方式获取日期: DOC = htmlParse(theURL,asText = T) xpathSApply(DOC," //简称",xmlValue)
但是对于其他网页,日期是在" mega"标签,例如:
meta name = \" created \"含量= \" 2011-12-29T11:49:23 + 00:00 \"
meta name = \" OriginalPublicationDate \" content = \" 2012/11/14 10:56:58 \"
我尝试了xpathSApply(doc," // meta",xmlValue),但它没有用。
那么,我应该使用什么模式而不是" // meta"?
谢谢!
答案 0 :(得分:2)
以此页面为例:
library(XML)
url <- "http://stackoverflow.com/questions/22342501/"
doc <- htmlParse(url, useInternalNodes=T)
names <- doc["//meta/@name"]
content <- doc["//meta/@content"]
cbind(names,content)
# names content
# [1,] "twitter:card" "summary"
# [2,] "twitter:domain" "stackoverflow.com"
# [3,] "og:type" "website"
# [4,] "og:image" "http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon@2.png?v=fde65a5a78c6"
# [5,] "og:title" "how to get information within <meta name...> tag in html using htmlParse and xpathSApply"
# [6,] "og:description" "I have a bunch of webpages and I want to extract their publishing dates. \nFor some webpages, the da" [truncated]
# [7,] "og:url" "http://stackoverflow.com/questions/22342501/how-to-get-information-within-meta-name-tag-in-html-usi" [truncated]
的问题
xpathSApply(doc, "//meta",xmlValue)
是xmlValue(...)
返回元素内容(例如,元素的文本部分)。 <meta>
代码没有文字。