R - 如何从下载的HTML代码中提取JavaScript对象中的值

时间:2015-11-10 03:22:25

标签: javascript html r

我正在使用包rvest进行网页抓取,但我很难提取JavaScript对象的信息。

JavaScript的形式为:

... some js ...
var selectoptions = {
  "Region A": {
  "key" : "a",
  "defaultvalue" : "a",
  "values" : { //key : value
                    "(A01) A1": "a01",
                    "(A02) A2": "a02",
                    "(A03) A3": "a03",
                    "(A04) A4": "a04"
  }
 }, 
  "Region B": {
  "key" : "b",
  "defaultvalue" : "b",
  "values" : { //key : value
                    "(B01) B1": "b01",
                    "(B02) B2": "b02",
                    "(B03) B3": "b03",
                    "(B04) B4": "b04"
  }
 }
}
... some js ...

如何提取信息(每个地区的"values")?

以下是我的尝试:

library(rvest)
library(stringr)
url <- "http://www.census2011.gov.hk/en/constituency-area.html" #the url
js_code <- html(url) %>% html_nodes("script") %>% html_text()
js_code <- js_code[[9]] # The information I wanted is in the 9th element
info_wanted1 <- str_extract(js_code, "\\{.*?\\}")
info_wanted2 <- str_extract_all(js_code, "\\{.*?\\}")

> info_wanted1
[1] NA
> info_wanted2
[[1]]
character(0)

但它什么也没有回报。我想它至少会给我{ }嵌入的内容。我犯了什么错误吗?有什么建议吗?

谢谢!

1 个答案:

答案 0 :(得分:3)

这是我能够构建的最干净的正则表达式解析:

js2 <- strsplit(js_code,  "value")  # first split on "value"
# Then remove the first item which preceded the first instance and work on the rest.

js3 <- lapply( js2[[1]][-1], function(tx) {
                                  regmatches(tx, gregexpr("value[^{]+\\}", tx) ) })

该模式的[^}]\\}"部分是一个否定的字符类,基本上表示在文本后面的第一个curry-brace之前返回所有非curry-brace字符 - value

----

早期的探索:

首先,我将该文本分配给变量名称txt,但没有使用读取操作,而这些操作可能会被换行符破坏。

您的模式在该文字中不匹配:

> regmatches(txt, gregexpr("\\{.?\\n\\}", txt) )
[[1]]
character(0)

但稍作修改就可以了:

> regmatches(txt, gregexpr("\\{.+\\n\\}", txt) )
[[1]]
[1] "{\n  \"Region A\": {\n  \"key\" : \"a\",\n  \"defaultvalue\" : \"a\",\n  \"values\" : { //key : value\n                    \"(A01) A1\": \"a01\",\n                    \"(A02) A2\": \"a02\",\n                    \"(A03) A3\": \"a03\",\n                    \"(A04) A4\": \"a04\"\n  }\n }, \n  \"Region B\": {\n  \"key\" : \"b\",\n  \"defaultvalue\" : \"b\",\n  \"values\" : { //key : value\n                    \"(B01) B1\": \"b01\",\n                    \"(B02) B2\": \"b02\",\n                    \"(B03) B3\": \"b03\",\n                    \"(B04) B4\": \"b04\"\n  }\n }\n}"

由于正则表达式通常是贪婪的&#34;,算法找到第一个匹配,然后尽可能地匹配所有内容,包括最后一个curley-brace。

要打败正则表达式的贪婪,首先需要通过适当的分隔符拆分成单独的字符向量元素,然后选择字符串:values

> js2 <- strsplit(js_code,  "values")
> js3 <- lapply( js2[[1]], function(tx) {regmatches(tx, gregexpr("\\{.+\\}", tx) ) })
> js3[[1]]
[[1]]
[1] "{\r\n\t\t //create a bubble popup for each DOM element with class attribute as \"text\", \"button\" or \"link\" and LI, P, IMG elements.\r\n\t\t $('.link-1').CreateBubblePopup({\r\n  position : 'top',\r\n  align : 'center',\r\n  innerHtml: 'Terms and Definitions',\r\n  innerHtmlStyle: {\r\n\t\t\t  color:'#FFFFFF', \r\n\t\t\t  'text-align':'center',\r\n\t\t\t  'padding':'5px'\r\n\t\t\t },\r\n  themeName: 'all-black',\r\n  themePath: 'images/jquerybubblepopup-theme'\r\n });\r\n\t\t $('.link-2').CreateBubblePopup({\r\n  position : 'top',\r\n  align\t : 'center',\r\n  innerHtml: 'Data Dissemination Events',\r\n  innerHtmlStyle: {\r\n   color:'#FFFFFF', \r\n   'text-align':'center',\r\n   'padding':'5px'\r\n  },\r\n  themeName: \t'all-black',\r\n  themePath: \t'images/jquerybubblepopup-theme'\r\n });\r\n $('.link-3').CreateBubblePopup({\r\n  position : 'top',\r\n  align\t : 'center',\r\n  innerHtml: 'Download 2011 District Council Electoral Boundaries Index Map',\r\n  innerHtmlStyle: {\r\n   color:'#FFFFFF', \r\n   'text-align':'center',\r\n   'padding':'5px'\r\n  },\r\n  themeName: \t'all-black',\r\n\t\t\t\tthemePath: \t'images/jquerybubblepopup-theme'\r\n });\r\n  });\r\n  $(document).ready(function(){\r\n\t  var options = {\r\n\t\t\t\tpreselectFirst : \"hki\",\r\n\t\t\t\tpreselectSecond : \"a01\",\r\n\t\t\t\temptyOption: false,\r\n\t\t\t\temptyValue: 'Please Select',\r\n\t\t\t\temptyKey: '-'\r\n }"

> js3[[2]]
[[1]]
[1] "{ //key : value\r\n\t\t\t\t\t\"(A01) Chung Wan\": \"a01\",\r\n\t\t\t\t\t\"(A02) Mid Levels East\": \"a02\",\r\n\t\t\t\t\t\"(A03) Castle Road\": \"a03\",\r\n\t\t\t\t\t\"(A04) Peak\": \"a04\",\r\n\t\t\t\t\t\"(A05) University\": \"a05\",\r\n\t\t\t\t\t\"(A06) Kennedy Town & Mount Davis\": \"a06\",\r\n\t\t\t\t\t\"(A07) Kwun Lung\": \"a07\",\r\n\t\t\t\t\t\"(A08) Sai Wan\": \"a08\",\r\n\t\t\t\t\t\"(A09) Belcher\": \"a09\",\r\n\t\t\t\t\t\"(A10) Shek Tong Tsui\": \"a10\",\r\n\t\t\t\t\t\"(A11) Sai Ying Pun\": \"a11\",\r\n\t\t\t\t\t\"(A12) Sheung Wan\": \"a12\",\r\n\t\t\t\t\t\"(A13) Tung Wah\": \"a13\",\r\n\t\t\t\t\t\"(A14) Centre Street\": \"a14\",\r\n\t\t\t\t\t\"(A15) Water Street\": \"a15\",\r\n\t\t\t\t\t\"(B01) Hennessy\": \"b01\",\r\n\t\t\t\t\t\"(B02) Oi Kwan\": \"b02\",\r\n\t\t\t\t\t\"(B03) Canal Road\": \"b03\",\r\n\t\t\t\t\t\"(B04) Causeway Bay\": \"b04\",\r\n\t\t\t\t\t\"(B05) Tai Hang\": \"b05\",\r\n\t\t\t\t\t\"(B06) Jardine's Lookout\": \"b06\",\r\n\t\t\t\t\t\"(B07) Broadwood\": \"b07\",\r\n\t\t\t\t\t\"(B08) Happy Valley\": \"b08\",\r\n\t\t\t\t\t\"(B09) Stubbs Road\": \"b09\",\r\n\t\t\t\t\t\"(B10) Southorn\": \"b10\",\r\n\t\t\t\t\t\"(B11) Tai Fat Hau\": \"b11\",\r\n\t\t\t\t\t\"(C01) Tai Koo Shing West\": \"c01\",\r\n\t\t\t\t\t\"(C02) Tai Koo Shing East\": \"c02\",\r\n\t\t\t\t\t\"(C03) Lei King Wan\": \"c03\",\r\n\t\t\t\t\t\"(C04) Aldrich Bay\": \"c04\",\r\n\t\t\t\t\t\"(C05) Shaukeiwan\": \"c05\",\r\n\t\t\t\t\t\"(C06) A Kung Ngam\": \"c06\",\r\n\t\t\t\t\t\"(C07) Heng Fa Chuen\": \"c07\",\r\n\t\t\t\t\t\"(C08) Tsui Wan\": \"c08\",\r\n\t\t\t\t\t\"(C09) Yan Lam\": \"c09\",\r\n\t\t\t\t\t\"(C10) Siu Sai Wan\": \"c10\",\r\n\t\t\t\t\t\"(C11) King Yee\": \"c11\",\r\n\t\t\t\t\t\"(C12) Wan Tsui\": \"c12\",\r\n\t\t\t\t\t\"(C13) Fei Tsui\": \"c13\",\r\n\t\t\t\t\t\"(C14) Mount Parker\": \"c14\",\r\n\t\t\t\t\t\"(C15) Braemar Hill\": \"c15\",\r\n\t\t\t\t\t\"(C16) Tin Hau\": \"c16\",\r\n\t\t\t\t\t\"(C17) Fortress Hill\": \"c17\",\r\n\t\t\t\t\t\"(C18) Victoria Park\": \"c18\",\r\n\t\t\t\t\t\"(C19) City Garden\": \"c19\",\r\n\t\t\t\t\t\"(C20) Provident\": \"c20\",\r\n\t\t\t\t\t\"(C21) Fort Street\": \"c21\",\r\n\t\t\t\t\t\"(C22) Kam Ping\": \"c22\",\r\n\t\t\t\t\t\"(C23) Tanner\": \"c23\",\r\n\t\t\t\t\t\"(C24) Healthy Village\": \"c24\",\r\n\t\t\t\t\t\"(C25) Quarry Bay\": \"c25\",\r\n\t\t\t\t\t\"(C26) Nam Fung\": \"c26\",\r\n\t\t\t\t\t\"(C27) Kornhill\": \"c27\",\r\n\t\t\t\t\t\"(C28) Kornhill Garden\": \"c28\",\r\n\t\t\t\t\t\"(C29) Hing Tung\": \"c29\",\r\n\t\t\t\t\t\"(C30) Sai Wan Ho\": \"c30\",\r\n\t\t\t\t\t\"(C31) Lower Yiu Tung\": \"c31\",\r\n\t\t\t\t\t\"(C32) Upper Yiu Tung\": \"c32\",\r\n\t\t\t\t\t\"(C33) Hing Man\": \"c33\",\r\n\t\t\t\t\t\"(C34) Lok Hong\": \"c34\",\r\n\t\t\t\t\t\"(C35) Tsui Tak\": \"c35\",\r\n\t\t\t\t\t\"(C36) Yue Wan\": \"c36\",\r\n\t\t\t\t\t\"(C37) Kai Hiu\": \"c37\",\r\n\t\t\t\t\t\"(D01) Aberdeen\": \"d01\",\r\n\t\t\t\t\t\"(D02) Ap Lei Chau Estate\": \"d02\",\r\n\t\t\t\t\t\"(D03) Ap Lei Chau North\": \"d03\",\r\n\t\t\t\t\t\"(D04) Lei Tung I\": \"d04\",\r\n\t\t\t\t\t\"(D05) Lei Tung II\": \"d05\",\r\n\t\t\t\t\t\"(D06) South Horizons East\": \"d06\",\r\n\t\t\t\t\t\"(D07) South Horizons West\": \"d07\",\r\n\t\t\t\t\t\"(D08) Wah Kwai\": \"d08\",\r\n\t\t\t\t\t\"(D09) Wah Fu I\": \"d09\",\r\n\t\t\t\t\t\"(D10) Wah Fu II\": \"d10\",\r\n\t\t\t\t\t\"(D11) Pokfulam\": \"d11\",\r\n\t\t\t\t\t\"(D12) Chi Fu\": \"d12\",\r\n\t\t\t\t\t\"(D13) Tin Wan\": \"d13\",\r\n\t\t\t\t\t\"(D14) Shek Yue\": \"d14\",\r\n\t\t\t\t\t\"(D15) Wong Chuk Hang\": \"d15\",\r\n\t\t\t\t\t\"(D16) Bays Area\": \"d16\",\r\n\t\t\t\t\t\"(D17) Stanley & Shek O\": \"d17\"\r\n  }\r\n }"

然后你需要&#34;清理&#34;通过修剪这些块的前导和尾随部分中不必要的东西,事实证明,至少删除了第一个看起来不像你想要的表格的项目。