正则表达式:从单行文本中提取多个数字

时间:2019-06-03 04:49:00

标签: r regex

问题

我已经从该网站下载了一系列表格:

url <- "https://www.ato.gov.au/Rates/Individual-income-tax-for-prior-years/"
df <- url %>%
  read_html() %>%
  html_table() %>%
  setNames(., url %>%
             read_html() %>%
             html_nodes("caption") %>%
             html_text())

我需要从表中包含的Tax on this income变量中提取数字:

$`Resident tax rates for 2016-17`
      Taxable income                         Tax on this income
1        0 – $18,200                                        Nil
2  $18,201 – $37,000               19c for each $1 over $18,200
3  $37,001 – $87,000 $3,572 plus 32.5c for each $1 over $37,000
4 $87,001 – $180,000  $19,822 plus 37c for each $1 over $87,000
5  $180,001 and over $54,232 plus 45c for each $1 over $180,000

理想情况下,我想在每个表中添加以下数据的三列:

新栏1:NA, 3572, 19822, 54232

新栏2:19, 32.5, 37, 45

新栏3:18200, 37000, 87000, 180000

大多数表都遵循上面表格的格式,但是有些表具有更多行,有些表使用“分”-即第2行,第2列将显示为:

  

超过$ 18,200,每$ 1赚19美分

因此正则表达式模式需要匹配19c和19美分。

我(可怜的)尝试

str_extract_all(df$ 2016-17年居民税率[2], pattern = "(?<=\\$)\\d*,\\d{3}")

此模式仅匹配美元金额,并返回一个字符向量(均不理想)。

3 个答案:

答案 0 :(得分:2)

这里在3列中使用3种不同的表达式

library(dplyr)
library(stringr)

df[[1]] %>%
   mutate(`Tax on this income` = gsub(",", "", `Tax on this income`), 
          col1 = str_extract(`Tax on this income`, "(?<=^\\$)\\d+"), 
          col2 = str_extract(`Tax on this income`, "\\d+.(\\d+)?(?=(\\s+)?c)"),
          col3 = str_extract(`Tax on this income`, "(?<=\\$)\\d+$"))

#      Taxable income                       Tax on this income  col1 col2   col3
#1        0 – $18,200                                      Nil  <NA> <NA>   <NA>
#2  $18,201 – $37,000              19c for each $1 over $18200  <NA>   19  18200
#3  $37,001 – $87,000 $3572 plus 32.5c for each $1 over $37000  3572 32.5  37000
#4 $87,001 – $180,000  $19822 plus 37c for each $1 over $87000 19822   37  87000
#5  $180,001 and over $54232 plus 45c for each $1 over $180000 54232   45 180000

由于"cents"也以"c"开头,因此当您使用“分”而不是“ c”时,这也将起作用。

df[[19]] %>%
  mutate(`Tax on this income` = gsub(",", "", `Tax on this income`), 
          col1 = str_extract(`Tax on this income`, "(?<=^\\$)\\d+"), 
          col2 = str_extract(`Tax on this income`, "\\d+.(\\d+)?(?=(\\s+)?c)"),
          col3 = str_extract(`Tax on this income`, "(?<=\\$)\\d+$"))


#     Taxable income                           Tax on this income  col1 col2  col3
#1       $1 – $5,400                                          Nil  <NA> <NA>  <NA>
#2  $5,401 – $20,700              20 cents for each $1 over $5400  <NA>  20   5400
#3 $20,701 – $38,000  $3060 plus 34 cents for each $1 over $20700  3060  34  20700
#4 $38,001 – $50,000  $8942 plus 43 cents for each $1 over $38000  8942  43  38000
#5  $50,001 and over $14102 plus 47 cents for each $1 over $50000 14102  47  50000

有了数据框列表,您可以使用map将其应用于每个数据框

purrr::map(df,.%>%
             mutate(`Tax on this income` = gsub(",", "", `Tax on this income`), 
             col1 = str_extract(`Tax on this income`, "(?<=^\\$)\\d+"), 
             col2 = str_extract(`Tax on this income`, "\\d+.(\\d+)?(?=(\\s+)?c)"),
             col3 = str_extract(`Tax on this income`, "(?<=\\$)\\d+$")))

答案 1 :(得分:1)

pattern = "(?:\\$(\\S+)\\s*plus\\s*)?(\\d++[.]?\\d*)\\s*c.*\\$(\\d++,.*)|.*Nil.*"

clean = function(x){
  nw = gsub(',','',trimws(gsub(pattern,'\\1:\\2:\\3',x[,2],perl=T)))
  cbind(x,read.table(text = nw,fill=T,sep = ':',col.names = paste0('col',1:3)))
}

lapply(df,clean)

`Resident tax rates for 1983-84`
     Taxable income                                Tax on this income     col1 col2  col3
1       $1 – $4,594                                               Nil       NA   NA    NA
2  $4,595 – $19,499                  30 cents for each $1 over $4,595       NA   30  4595
3 $19,500 – $35,787  $4,471.50 plus 46 cents for each $1 over $19,500  4471.50   46 19500
4  $35,788 and over $11,963.98 plus 60 cents for each $1 over $35,788 11963.98   60 35788

答案 2 :(得分:0)

设计表达式非常复杂。也许让我们为每个表设计两个表达式,然后编写剩下的问题脚本。

例如,对于表应税收入,我们可以从类似于以下内容的表达式开始

(\d+)(\s+)?(\$?([0-9,]+)[\s–]+\$?([0-9,]+|and over)?)

Demo 1

以及其他表格:

\s+Nil|\$?([0-9,]+)?\s+?(plus\s+)?([0-9,.]+)c?\s+for each\s+(\$1 over)\s+\$?([0-9,]+)

Demo 2

RegEx电路

jex.im可视化正则表达式:

enter image description here

enter image description here