R - 当缺少标签时用rvest刮取HTML表

时间:2015-06-22 20:48:30

标签: html r html-table rvest

我正在尝试使用rvest从网站上抓取HTML表格。唯一的问题是我正在尝试刮除的表没有SELECT标签,除了第一行。它看起来像这样:

<tr>

等等。因此,使用以下代码,我只能将第一行放入我的数据框中:

<tr> 
  <td>6/21/2015 9:38 PM</td>
  <td>5311 Lake Park</td>
  <td>UCPD</td>
  <td>African American</td>
  <td>Male</td>
  <td>Subject was causing a disturbance in the area.</td>
  <td>Name checked; no further action</td>
  <td>No</td>
</tr>

  <td>6/21/2015 10:37 PM</td>
  <td>5200 S Blackstone</td>
  <td>UCPD</td>
  <td>African American</td>
  <td>Male</td>
  <td>Subject was observed fighting in the McDonald's parking lot</td>
  <td>Warned; released</td>
  <td>No</td>
</tr>

如何改变这一点以使html_table理解行是行,即使它们没有开放library(rvest) mydata <- html_session("https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015") %>% html_node("table") %>% html_table(header = TRUE, fill=TRUE) 标记?或者有更好的方法来解决这个问题吗?

3 个答案:

答案 0 :(得分:2)

library(rvest)

url_parse<- read_html("https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015") 

col_name<- url_parse %>%
  html_nodes("th") %>%
  html_text()

mydata <- url_parse %>%
  html_nodes("td") %>%
  html_text()

finaldata <- data.frame(matrix(mydata, ncol=7, byrow=TRUE))

names(finaldata) <- col_name

finaldata

                     Incident                                  Location    

    Reported                              Occurred
1                           Theft       1115 E. 58th St. (Walker Bike Rack) 6/1/15 12:18 PM 5/31/15 to 6/1/15 8:00 PM to 12:00 PM
2                     Information                          5835 S. Kimbark   6/1/15 3:57 PM                        6/1/15 3:55 PM
3                     Information                  1025 E. 58th St. (Swift)  6/2/15 2:18 AM                        6/2/15 2:18 AM
4 Non-Criminal Damage to Property                850 E. 63rd St. (Car Wash)  6/2/15 8:48 AM                        6/2/15 8:00 AM
5     Criminal Damage to Property 5631 S. Cottage Grove (Parking Structure)  6/2/15 7:32 PM             6/2/15 6:45 PM to 7:30 PM
                                                                                                                   Comments / Nature of Fire Disposition
1                                                                                       Bicycle secured to bike rack taken by unknown person        Open
2             Unknown person used staff member's personal information to file a fraudulent claim with U.S. Social Security Admin. / CPD case         CPD
3 Three unaffiliated individuals reported tampering with bicycles in bike rack / Subjects were given trespass warnings and sent on their way      Closed
4                                                                      Rear wiper blade assembly damaged on UC owned vehicle during car wash      Closed
5                                                           Unknown person(s) spray painted graffiti on north concrete wall of the structure        Open
  UCPDI#
1 E00344
2 E00345
3 E00346
4 E00347
5 E00348

答案 1 :(得分:2)

与@ user227710略有不同,但大致相同。类似地,这也利用了double s的数量是均匀的这一事实。

,这也会将所有事件(TD每个页面抓取到一个rbind数据框中)。

incidents只会给你进度条,因为这需要几秒钟。除非在交互式会话中完全没有必要。

pblapply

答案 2 :(得分:0)

谢谢大家!我最终从其他R用户那里得到了一些帮助,他们提出了以下解决方案。它需要html,保存它,添加<tr>(很像@Bram Vanroy建议的),并将其转换回html对象,然后可以将其刮入数据帧。

library(rvest)
myurl <- "https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015"
download.file(myurl, destfile="myfile.html", method="curl")
myhtml <- readChar("myfile.html", file.info("myfile.html")$size)
myhtml <- gsub("</tr>", "</tr><tr>", myhtml, fixed = TRUE)
mydata <- html(myhtml)

mydf <- mydata %>%
  html_node("table") %>%
  html_table(fill = TRUE)

mydf <- na.omit(mydf)

最后一行是省略用这种方法显示的一些奇怪的NA行。