Web将日期和字符串从表中刮出到R中

时间:2016-05-01 01:29:30

标签: html r web-scraping

我需要从发布列下的每个包含就业情况的日期http://www.bls.gov/schedule/news_release/2015_sched.htm进行网络搜索。网页报废输出应如下:

Friday, January 09, 2015
Friday, February 06, 2015
Friday, March 06, 2015
Friday, April 03, 2015
Friday, May 08, 2015
Friday, June 05, 2015
Thursday, July 02, 2015
Friday, August 07, 2015
Friday, September 04, 2015
Friday, October 02, 2015
Friday, November 06, 2015
Friday, December 04, 2015

为了达到这个目的,我想重复以下12次,每个月一次。注意http://www.bls.gov/schedule/news_release/2015_sched.htm包含12个表,每月一个,名为tbl2[[2]]tbl3[[3]],依此类推。

library(rvest)
url <- 'http://www.bls.gov/schedule/news_release/2015_sched.htm'
ses <- html_session(url)
tbl <- html_table(ses, fill = T) 
nfpdates <- tbl[[2]]$`Date`
nfpdates <- gsub('\\.', '', nfpdates)
nfpdates <- as.Date(nfpdates, 'weekdaystr(iD,:), %b %d, %Y')

它不起作用。第一个问题很简单:我不知道如何引用星期几:'weekdaystr(iD,:)是错误的。第二个更复杂:如何只提取包含&#34;就业情况&#34;在&#34;发布&#34;?

非常感谢任何帮助。谢谢。

2 个答案:

答案 0 :(得分:3)

这是XPath的完美用例:

library(rvest)

pg <- read_html("http://www.bls.gov/schedule/news_release/2015_sched.htm")

# we need to target only the <td> elements under the bodytext div
body <- html_nodes(pg, "div#bodytext")

# we use this new set of nodes and a relative XPath to get the initial <td> elements, then get their siblings
es_nodes <- html_nodes(body, xpath=".//td[contains(., 'Employment Situation for')]/../td[1]")

# clean up the cruft and make our dates!
as.Date(trimws(html_text(es_nodes)), format="%A, %B %d, %Y")

##  [1] "2015-01-09" "2015-02-06" "2015-03-06" "2015-03-18" "2015-04-03"
##  [6] "2015-05-08" "2015-06-05" "2015-07-02" "2015-08-07" "2015-09-04"
## [11] "2015-10-02" "2015-11-06" "2015-12-04"

答案 1 :(得分:2)

就第一个问题而言,可以使用以下格式解决:

nfpdates <- as.Date(nfpdates,"%A, %B %d, %Y")

现在,使用weekdays()功能,您可以找到星期几。

现在,进入第二个问题,假设您正在提取“就业状况”的日期。出现在&#39;发布&#39;柱,

可以通过以下方式完成:

test <- tbl[[2]]$Date

test[grepl('Employment Situation',tbl[[2]]$Release)]