R html带有重定向链接,单词搜索和计数

时间:2014-05-15 00:53:53

标签: r web-scraping

我正在尝试使用R抓码来简化繁琐的在线数据收集过程。我目前感兴趣的网站是:Wisconsin Bills- Author index

该网站提供了一个指向每个立法者的重定向链接,然后在每个立法者下面都有一个引入的账单清单,以及每个账单的主要行动摘要的链接。我的最终目标是创建一个数据框,其中包括立法者姓名列,汇编单号数(仅包含" AB"的链接),通过汇编的票据数量以及签署的票据数量成为法律。

抓取网站,我已成功创建了一个数据框,其中包含每个立法者的名字,姓氏,地区,州(总是WI)和年份(总是1999年,t-1是会议结束时)。以下是我的代码:

#specify the URL
url <- "https://docs.legis.wisconsin.gov/1997/related/author_index/assembly"

#download the HTML code
html <- getURL(url, ssl.verifypeer = FALSE, followlocation = TRUE)

#parse the HTML code
html.parsed <- htmlTreeParse(html, useInternalNodes = T) 

# Get list of legislator names:
names <- xpathSApply(html.parsed, path="//a[contains(@href, 'authorindex')]", xmlValue)

# get all links into a list:
links <- xpathSApply(html.parsed2, "//a/@href")
# see what I have:
head(links) # still have hrefs in there
links <- as.vector(links) 
head(links) # good, hrefs are dropped.

# I only need the links that begin with /document/authorindex/1997. 
typeof(links) # confirming its character 
links # looking to see which ones to keep (only ones with "authorindex" and "A__",    where the number that follows A is the district)
links <- links[14:114] # now the links only have the legislator redirects!!!

# Lets begin to build the final data frame needed:
# first, take a look at names- there are 104, but there are only 100 legislators...
names # elements 3-103 are leg names
names <- names[3:103]

# split up by first name, last name, etc.
names <- as.vector(names)
names1 <- strsplit(names, ",")
last.names <- sapply(names1, "[[", 1) # good- create a data frame 
id = c(1:101)
df <- data.frame(ID= id)
df$last.name = last.names # now have an ID and their last name.

# now need district, party, and first names.
first_names <- strsplit(names, "p.")
first_names # now republicans have 3 elements, dems have 2, first word of 2nd element     is first name
# do another strsplit
first_names <- as.character(first_names)
first_names <- strsplit(first_names, " ()")
first_names # 4th element is almost always their name! do it that way, correct those that messed up by hand
first_names <- sapply(first_names, "[[", 4)  
first_names # 10 (Timothy), 90 (William) 80 (Joan H) 80 (Tom) 47 (John)
# 25 (Jose) 17 (Stephen) 5 (Spencer)
first_names[5] <- "Spencer"
first_names[10] <- "Timothy"
first_names[90] <- "William"
first_names[80] <- "Joan H."
first_names[81] <- "Tom"
first_names[47] <- "John"
first_names[25] <- "Jose"
first_names[17] <- "Stephen"
df$first.name <- first_names # first names- done.

# district:
district <- regmatches(names, gregexpr("[[:digit:]]+", names))
df$district <- district
df$state <- "WI"
df$year <- 1999

现在,我很难过。我需要按照每个重定向链接,只计算该立法者姓名下的AB链接数量,按照AB链接计算,并计算每个立法者的AB站点数#&#34;通过&# 34;在他们和拥有“#34; Sen。&#34;”字样的AB网站中。在他们中。因此,我想在现有的df中添加以下列:

Bills Introduced     Bills Passed Assembly    Bills Signed into Law
4                    3                        2
39                   18                       14

等。我觉得我需要使用循环,但我不知道如何处理它。

任何帮助都会令人难以置信。

谢谢!

0 个答案:

没有答案