从GroupGrid刮取数据

时间:2016-02-27 18:36:17

标签: r web-scraping rvest

我想抓取并分析一个when2meet表的输入。

以下是一个示例:http://www.when2meet.com/?4474391-IBuBA

preview

该表格提供了每个小组成员可用性的快速视觉概览;我想把它提取到R进行一些分析,但是我很快就会出现。

事实上,非常短暂;我只是提取主页面元素。输出(对我来说)是胡言乱语:

library(rvest)

url <- "http://www.when2meet.com/?4474391-IBuBA"

grid <- html(url) %>% html_nodes(xpath = '//*[@id="GroupGrid"]')

grid看起来像这样:

<div style="font-size:0px;vertical-align:top;"><div id="GroupTime279816300" onmouseover="ShowSlot(279816300);" style="vertical-align:top;display:inline-block;*display:inline;zoom:1;width:44px;height:9px;font-size:0px;border-left: 1px black solid;background: #c5e2b6;"><script><![CDATA[
Col[TimeOfSlot.indexOf(279816300)] = 0;
Row[TimeOfSlot.indexOf(279816300)] = 23;
]]></script></div>
<div id="GroupTime279902700" onmouseover="ShowSlot(279902700);" style="vertical-align:top;display:inline-block;*display:inline;zoom:1;width:44px;height:9px;font-size:0px;border-left: 1px black solid;background: #8ac56d;"><script><![CDATA[
Col[TimeOfSlot.indexOf(279902700)] = 1;
Row[TimeOfSlot.indexOf(279902700)] = 23;
]]></script></div>
<div id="GroupTime279989100" onmouseover="ShowSlot(279989100);" style="vertical-align:top;display:inline-block;*display:inline;zoom:1;width:44px;height:9px;font-size:0px;border-left: 1px black solid;background: #c5e2b6;"><script><![CDATA[
Col[TimeOfSlot.indexOf(279989100)] = 2;
Row[TimeOfSlot.indexOf(279989100)] = 23;
]]></script>

我在这里基本上没有看到任何用处;它也可能是乌尔都语。我无法在Google或SO上找到关于抓取GroupGrid表格的任何内容。

有没有人知道如何继续?

理想情况下,我有一个输出data.tabledata.frame,如果必须的话):

output
#        id    slot available
# 1: user_1  M 9:00      TRUE
# 2: user_1  T 9:30     FALSE
# 3: user_1 W 10:00      TRUE
# 4: user_1 R 10:30      TRUE
# 5: user_2  M 9:00      TRUE
# 6: user_2  T 9:30     FALSE
# 7: user_2 W 10:00      TRUE
# 8: user_2 R 10:30     FALSE

slot列的确切格式并不重要,也不需要是一列 - 如果更容易,可以是daytime

1 个答案:

答案 0 :(得分:2)

你可以这样做

library(data.table)

script <- html("http://www.when2meet.com/?4474391-IBuBA") %>%
  html_nodes("script:contains('PeopleNames')") %>% html_text()

f <- function(regex) {
  m <- regmatches(script, gregexpr(regex, script))[[1]]
  #faster than transposing with `t`
  setDT(transpose(lapply(regmatches(m, regexec(regex, m)), "[", -1)))[]
}
slots <- f("TimeOfSlot\\[(\\d+)\\]=(\\d+);")
users <- f( "PeopleNames\\[(\\d+)\\] = '([^']+)';PeopleIDs\\[\\d+\\] = (\\d+);")
avails <- f("AvailableAtSlot\\[(\\d+)]\\.push\\((\\d+)\\);")

DT <- melt(dcast(avails, V2~V1, 
                 fun.aggregate = function(x) length(x) > 0,
                 value.var = "V2"), id.vars = "V2",
           variable.name = "timeslot", value.name = "available")

DT[users, id := i.V2, on = c(V2 = "V3")]
DT[slots, time := format(as.POSIXct(as.integer(
  i.V2), origin = "1970-01-01", tz = "GMT"), "%a %H:%M"),
  on = c(timeslot = "V1")]

DT[ , c("V2", "timeslot") := NULL]

DT[time == "Mon 11:00" & available]
#    available      id      time
# 1:      TRUE  user_1 Mon 11:00
# 2:      TRUE  user_2 Mon 11:00
# 3:      TRUE  user_3 Mon 11:00
# 4:      TRUE  user_4 Mon 11:00
# 5:      TRUE  user_5 Mon 11:00
# 6:      TRUE  user_7 Mon 11:00
# 7:      TRUE user_10 Mon 11:00

DT[time == "Mon 11:00" & !available]
#    available     id      time
# 1:     FALSE user_6 Mon 11:00
# 2:     FALSE user_8 Mon 11:00
# 3:     FALSE user_9 Mon 11:00

enter image description here