Question

我有一堆网页，我想提取他们的发布日期。对于某些网页，日期位于＆＃34; abbr＆＃34;标签（如：abbr class = \＆＃34;已发布\＆＃34; title = \＆＃34; 2012-03-14T07：13：39 + 00:00 \＆＃34;＆gt; 2012-03-14 ，7：13＆＃34;），我能够使用以下方式获取日期： DOC = htmlParse（theURL，asText = T） xpathSApply（DOC，＆＃34; //简称＆＃34;，xmlValue）

但是对于其他网页，日期是在＆＃34; mega＆＃34;标签，例如：
meta name = \＆＃34; created \＆＃34;含量= \＆＃34; 2011-12-29T11：49：23 + 00：00 \＆＃34;
meta name = \＆＃34; OriginalPublicationDate \＆＃34; content = \＆＃34; 2012/11/14 10:56:58 \＆＃34;

我尝试了xpathSApply（doc，＆＃34; // meta＆＃34;，xmlValue），但它没有用。

那么，我应该使用什么模式而不是＆＃34; // meta＆＃34;？

谢谢！

Answer 1

以此页面为例：

library(XML)
url <- "http://stackoverflow.com/questions/22342501/"
doc <- htmlParse(url, useInternalNodes=T)
names   <- doc["//meta/@name"]
content <- doc["//meta/@content"]
cbind(names,content)
#      names            content                                                                                                           
# [1,] "twitter:card"   "summary"                                                                                                         
# [2,] "twitter:domain" "stackoverflow.com"                                                                                               
# [3,] "og:type"        "website"                                                                                                         
# [4,] "og:image"       "http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon@2.png?v=fde65a5a78c6"                                  
# [5,] "og:title"       "how to get information within <meta name...> tag in html using htmlParse and xpathSApply"                        
# [6,] "og:description" "I have a bunch of webpages and I want to extract their publishing dates. \nFor some webpages, the da" [truncated]
# [7,] "og:url"         "http://stackoverflow.com/questions/22342501/how-to-get-information-within-meta-name-tag-in-html-usi" [truncated]

的问题

xpathSApply(doc, "//meta",xmlValue)

是xmlValue(...)返回元素内容（例如，元素的文本部分）。 <meta>代码没有文字。

如何使用htmlParse和xpathSApply在html中的<meta name ... =“”/>标记内获取信息

1 个答案: