我正在使用包rvest
进行网页抓取,但我很难提取JavaScript对象的信息。
JavaScript的形式为:
... some js ...
var selectoptions = {
"Region A": {
"key" : "a",
"defaultvalue" : "a",
"values" : { //key : value
"(A01) A1": "a01",
"(A02) A2": "a02",
"(A03) A3": "a03",
"(A04) A4": "a04"
}
},
"Region B": {
"key" : "b",
"defaultvalue" : "b",
"values" : { //key : value
"(B01) B1": "b01",
"(B02) B2": "b02",
"(B03) B3": "b03",
"(B04) B4": "b04"
}
}
}
... some js ...
如何提取信息(每个地区的"values"
)?
以下是我的尝试:
library(rvest)
library(stringr)
url <- "http://www.census2011.gov.hk/en/constituency-area.html" #the url
js_code <- html(url) %>% html_nodes("script") %>% html_text()
js_code <- js_code[[9]] # The information I wanted is in the 9th element
info_wanted1 <- str_extract(js_code, "\\{.*?\\}")
info_wanted2 <- str_extract_all(js_code, "\\{.*?\\}")
> info_wanted1
[1] NA
> info_wanted2
[[1]]
character(0)
但它什么也没有回报。我想它至少会给我{
}
嵌入的内容。我犯了什么错误吗?有什么建议吗?
谢谢!
答案 0 :(得分:3)
这是我能够构建的最干净的正则表达式解析:
js2 <- strsplit(js_code, "value") # first split on "value"
# Then remove the first item which preceded the first instance and work on the rest.
js3 <- lapply( js2[[1]][-1], function(tx) {
regmatches(tx, gregexpr("value[^{]+\\}", tx) ) })
该模式的[^}]\\}"
部分是一个否定的字符类,基本上表示在文本后面的第一个curry-brace之前返回所有非curry-brace字符 - value
。
早期的探索:
首先,我将该文本分配给变量名称txt
,但没有使用读取操作,而这些操作可能会被换行符破坏。
您的模式在该文字中不匹配:
> regmatches(txt, gregexpr("\\{.?\\n\\}", txt) )
[[1]]
character(0)
但稍作修改就可以了:
> regmatches(txt, gregexpr("\\{.+\\n\\}", txt) )
[[1]]
[1] "{\n \"Region A\": {\n \"key\" : \"a\",\n \"defaultvalue\" : \"a\",\n \"values\" : { //key : value\n \"(A01) A1\": \"a01\",\n \"(A02) A2\": \"a02\",\n \"(A03) A3\": \"a03\",\n \"(A04) A4\": \"a04\"\n }\n }, \n \"Region B\": {\n \"key\" : \"b\",\n \"defaultvalue\" : \"b\",\n \"values\" : { //key : value\n \"(B01) B1\": \"b01\",\n \"(B02) B2\": \"b02\",\n \"(B03) B3\": \"b03\",\n \"(B04) B4\": \"b04\"\n }\n }\n}"
由于正则表达式通常是贪婪的&#34;,算法找到第一个匹配,然后尽可能地匹配所有内容,包括最后一个curley-brace。
要打败正则表达式的贪婪,首先需要通过适当的分隔符拆分成单独的字符向量元素,然后选择字符串:values
:
> js2 <- strsplit(js_code, "values")
> js3 <- lapply( js2[[1]], function(tx) {regmatches(tx, gregexpr("\\{.+\\}", tx) ) })
> js3[[1]]
[[1]]
[1] "{\r\n\t\t //create a bubble popup for each DOM element with class attribute as \"text\", \"button\" or \"link\" and LI, P, IMG elements.\r\n\t\t $('.link-1').CreateBubblePopup({\r\n position : 'top',\r\n align : 'center',\r\n innerHtml: 'Terms and Definitions',\r\n innerHtmlStyle: {\r\n\t\t\t color:'#FFFFFF', \r\n\t\t\t 'text-align':'center',\r\n\t\t\t 'padding':'5px'\r\n\t\t\t },\r\n themeName: 'all-black',\r\n themePath: 'images/jquerybubblepopup-theme'\r\n });\r\n\t\t $('.link-2').CreateBubblePopup({\r\n position : 'top',\r\n align\t : 'center',\r\n innerHtml: 'Data Dissemination Events',\r\n innerHtmlStyle: {\r\n color:'#FFFFFF', \r\n 'text-align':'center',\r\n 'padding':'5px'\r\n },\r\n themeName: \t'all-black',\r\n themePath: \t'images/jquerybubblepopup-theme'\r\n });\r\n $('.link-3').CreateBubblePopup({\r\n position : 'top',\r\n align\t : 'center',\r\n innerHtml: 'Download 2011 District Council Electoral Boundaries Index Map',\r\n innerHtmlStyle: {\r\n color:'#FFFFFF', \r\n 'text-align':'center',\r\n 'padding':'5px'\r\n },\r\n themeName: \t'all-black',\r\n\t\t\t\tthemePath: \t'images/jquerybubblepopup-theme'\r\n });\r\n });\r\n $(document).ready(function(){\r\n\t var options = {\r\n\t\t\t\tpreselectFirst : \"hki\",\r\n\t\t\t\tpreselectSecond : \"a01\",\r\n\t\t\t\temptyOption: false,\r\n\t\t\t\temptyValue: 'Please Select',\r\n\t\t\t\temptyKey: '-'\r\n }"
> js3[[2]]
[[1]]
[1] "{ //key : value\r\n\t\t\t\t\t\"(A01) Chung Wan\": \"a01\",\r\n\t\t\t\t\t\"(A02) Mid Levels East\": \"a02\",\r\n\t\t\t\t\t\"(A03) Castle Road\": \"a03\",\r\n\t\t\t\t\t\"(A04) Peak\": \"a04\",\r\n\t\t\t\t\t\"(A05) University\": \"a05\",\r\n\t\t\t\t\t\"(A06) Kennedy Town & Mount Davis\": \"a06\",\r\n\t\t\t\t\t\"(A07) Kwun Lung\": \"a07\",\r\n\t\t\t\t\t\"(A08) Sai Wan\": \"a08\",\r\n\t\t\t\t\t\"(A09) Belcher\": \"a09\",\r\n\t\t\t\t\t\"(A10) Shek Tong Tsui\": \"a10\",\r\n\t\t\t\t\t\"(A11) Sai Ying Pun\": \"a11\",\r\n\t\t\t\t\t\"(A12) Sheung Wan\": \"a12\",\r\n\t\t\t\t\t\"(A13) Tung Wah\": \"a13\",\r\n\t\t\t\t\t\"(A14) Centre Street\": \"a14\",\r\n\t\t\t\t\t\"(A15) Water Street\": \"a15\",\r\n\t\t\t\t\t\"(B01) Hennessy\": \"b01\",\r\n\t\t\t\t\t\"(B02) Oi Kwan\": \"b02\",\r\n\t\t\t\t\t\"(B03) Canal Road\": \"b03\",\r\n\t\t\t\t\t\"(B04) Causeway Bay\": \"b04\",\r\n\t\t\t\t\t\"(B05) Tai Hang\": \"b05\",\r\n\t\t\t\t\t\"(B06) Jardine's Lookout\": \"b06\",\r\n\t\t\t\t\t\"(B07) Broadwood\": \"b07\",\r\n\t\t\t\t\t\"(B08) Happy Valley\": \"b08\",\r\n\t\t\t\t\t\"(B09) Stubbs Road\": \"b09\",\r\n\t\t\t\t\t\"(B10) Southorn\": \"b10\",\r\n\t\t\t\t\t\"(B11) Tai Fat Hau\": \"b11\",\r\n\t\t\t\t\t\"(C01) Tai Koo Shing West\": \"c01\",\r\n\t\t\t\t\t\"(C02) Tai Koo Shing East\": \"c02\",\r\n\t\t\t\t\t\"(C03) Lei King Wan\": \"c03\",\r\n\t\t\t\t\t\"(C04) Aldrich Bay\": \"c04\",\r\n\t\t\t\t\t\"(C05) Shaukeiwan\": \"c05\",\r\n\t\t\t\t\t\"(C06) A Kung Ngam\": \"c06\",\r\n\t\t\t\t\t\"(C07) Heng Fa Chuen\": \"c07\",\r\n\t\t\t\t\t\"(C08) Tsui Wan\": \"c08\",\r\n\t\t\t\t\t\"(C09) Yan Lam\": \"c09\",\r\n\t\t\t\t\t\"(C10) Siu Sai Wan\": \"c10\",\r\n\t\t\t\t\t\"(C11) King Yee\": \"c11\",\r\n\t\t\t\t\t\"(C12) Wan Tsui\": \"c12\",\r\n\t\t\t\t\t\"(C13) Fei Tsui\": \"c13\",\r\n\t\t\t\t\t\"(C14) Mount Parker\": \"c14\",\r\n\t\t\t\t\t\"(C15) Braemar Hill\": \"c15\",\r\n\t\t\t\t\t\"(C16) Tin Hau\": \"c16\",\r\n\t\t\t\t\t\"(C17) Fortress Hill\": \"c17\",\r\n\t\t\t\t\t\"(C18) Victoria Park\": \"c18\",\r\n\t\t\t\t\t\"(C19) City Garden\": \"c19\",\r\n\t\t\t\t\t\"(C20) Provident\": \"c20\",\r\n\t\t\t\t\t\"(C21) Fort Street\": \"c21\",\r\n\t\t\t\t\t\"(C22) Kam Ping\": \"c22\",\r\n\t\t\t\t\t\"(C23) Tanner\": \"c23\",\r\n\t\t\t\t\t\"(C24) Healthy Village\": \"c24\",\r\n\t\t\t\t\t\"(C25) Quarry Bay\": \"c25\",\r\n\t\t\t\t\t\"(C26) Nam Fung\": \"c26\",\r\n\t\t\t\t\t\"(C27) Kornhill\": \"c27\",\r\n\t\t\t\t\t\"(C28) Kornhill Garden\": \"c28\",\r\n\t\t\t\t\t\"(C29) Hing Tung\": \"c29\",\r\n\t\t\t\t\t\"(C30) Sai Wan Ho\": \"c30\",\r\n\t\t\t\t\t\"(C31) Lower Yiu Tung\": \"c31\",\r\n\t\t\t\t\t\"(C32) Upper Yiu Tung\": \"c32\",\r\n\t\t\t\t\t\"(C33) Hing Man\": \"c33\",\r\n\t\t\t\t\t\"(C34) Lok Hong\": \"c34\",\r\n\t\t\t\t\t\"(C35) Tsui Tak\": \"c35\",\r\n\t\t\t\t\t\"(C36) Yue Wan\": \"c36\",\r\n\t\t\t\t\t\"(C37) Kai Hiu\": \"c37\",\r\n\t\t\t\t\t\"(D01) Aberdeen\": \"d01\",\r\n\t\t\t\t\t\"(D02) Ap Lei Chau Estate\": \"d02\",\r\n\t\t\t\t\t\"(D03) Ap Lei Chau North\": \"d03\",\r\n\t\t\t\t\t\"(D04) Lei Tung I\": \"d04\",\r\n\t\t\t\t\t\"(D05) Lei Tung II\": \"d05\",\r\n\t\t\t\t\t\"(D06) South Horizons East\": \"d06\",\r\n\t\t\t\t\t\"(D07) South Horizons West\": \"d07\",\r\n\t\t\t\t\t\"(D08) Wah Kwai\": \"d08\",\r\n\t\t\t\t\t\"(D09) Wah Fu I\": \"d09\",\r\n\t\t\t\t\t\"(D10) Wah Fu II\": \"d10\",\r\n\t\t\t\t\t\"(D11) Pokfulam\": \"d11\",\r\n\t\t\t\t\t\"(D12) Chi Fu\": \"d12\",\r\n\t\t\t\t\t\"(D13) Tin Wan\": \"d13\",\r\n\t\t\t\t\t\"(D14) Shek Yue\": \"d14\",\r\n\t\t\t\t\t\"(D15) Wong Chuk Hang\": \"d15\",\r\n\t\t\t\t\t\"(D16) Bays Area\": \"d16\",\r\n\t\t\t\t\t\"(D17) Stanley & Shek O\": \"d17\"\r\n }\r\n }"
然后你需要&#34;清理&#34;通过修剪这些块的前导和尾随部分中不必要的东西,事实证明,至少删除了第一个看起来不像你想要的表格的项目。