我在R中的Tabulizer的帮助下从PDF中提取了一些表。下面是其中一个表的代码
library(tabulizer)
location <- "http://napic.jpph.gov.my/portal/web/guest/main-page?
p_p_id=ViewPublishings_WAR_ViewPublishingsportlet&
p_p_lifecycle=2&
p_p_state=normal&
p_p_mode=view&
p_p_resource_id=fileDownload&
p_p_cacheability=cacheLevelPage&
p_p_col_id=column-2&
p_p_col_pos=1&
p_p_col_count=2&
_ViewPublishings_WAR_ViewPublishingsportlet_publishingId=433&
_ViewPublishings_WAR_ViewPublishingsportlet_action=renderReportPeriodScreen&
_ViewPublishings_WAR_ViewPublishingsportlet_language=&
_ViewPublishings_WAR_ViewPublishingsportlet_pageno=1&
publishingId=4537"
out <- extract_tables(location, page=3)
提取的表的输出有一些怪癖,例如它被拆分为2并且一些数据没有正确分隔。
[[1]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] " Review " "States " "Single " "2 - 3 " "Single " "2 - 3 " "Detach " "Town " "Cluster " "Low " "Low " "Flat " "Condo- " "Total"
[2,] "Period " "" "Storey " "Storey " "Storey " "Storey " "" "House " "" "Cost " "Cost " "" "minium/" ""
[3,] "" "" "Terrace " "Terrace " "Semi- " "Semi- " "" "" "" "House " "Flat " "" "Apart-" ""
[4,] "" "" "" "" "Detach " "Detach " "" "" "" "" "" "" "ment" ""
[[2]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,] "EXISTING STOCK " "" "" "" "" "" "" "" "" "" "" "" ""
[2,] "" "" "" "" "" "" "" "" "" "" "" "" ""
[3,] "Q3 2016P WP Kuala Lumpur 21,574 " "" "66,286 " "466 " "5,968 " "7,098 " "4,671 " "4,248 " "3,786 " "95,647 " "50,156 " "163,119 " "423,019"
[4,] "WP Putrajaya 0 " "" "2,102 " "0 " "991 " "203 " "96 " "0 " "0 " "2,538 " "0 " "1,785 " "7,715"
[5,] "WP Labuan 835 " "" "1,044 " "70 " "944 " "5,686 " "11 " "0 " "966 " "680 " "1,300 " "225 " "11,761"
我正在寻找的所需输出应该接近original table
此刻我很难过,如果有人能指出我正确的方向,我会感激不尽。提前致谢。
答案 0 :(得分:0)
试试:
locate_areas(file, pages = NULL, resolution = 60L, widget = c("shiny",
"native", "reduced"), copy = FALSE)
找到要提取的区域,
然后你需要处理数据以获得你想要的。这是目前使用制表器的唯一方法。 问候。