我已经从该网站下载了一系列表格:
url <- "https://www.ato.gov.au/Rates/Individual-income-tax-for-prior-years/"
df <- url %>%
read_html() %>%
html_table() %>%
setNames(., url %>%
read_html() %>%
html_nodes("caption") %>%
html_text())
我需要从表中包含的Tax on this income
变量中提取数字:
$`Resident tax rates for 2016-17`
Taxable income Tax on this income
1 0 – $18,200 Nil
2 $18,201 – $37,000 19c for each $1 over $18,200
3 $37,001 – $87,000 $3,572 plus 32.5c for each $1 over $37,000
4 $87,001 – $180,000 $19,822 plus 37c for each $1 over $87,000
5 $180,001 and over $54,232 plus 45c for each $1 over $180,000
理想情况下,我想在每个表中添加以下数据的三列:
新栏1:NA, 3572, 19822, 54232
新栏2:19, 32.5, 37, 45
新栏3:18200, 37000, 87000, 180000
大多数表都遵循上面表格的格式,但是有些表具有更多行,有些表使用“分”-即第2行,第2列将显示为:
超过$ 18,200,每$ 1赚19美分
因此正则表达式模式需要匹配19c和19美分。
str_extract_all(df$
2016-17年居民税率[2], pattern = "(?<=\\$)\\d*,\\d{3}")
此模式仅匹配美元金额,并返回一个字符向量(均不理想)。
答案 0 :(得分:2)
这里在3列中使用3种不同的表达式
library(dplyr)
library(stringr)
df[[1]] %>%
mutate(`Tax on this income` = gsub(",", "", `Tax on this income`),
col1 = str_extract(`Tax on this income`, "(?<=^\\$)\\d+"),
col2 = str_extract(`Tax on this income`, "\\d+.(\\d+)?(?=(\\s+)?c)"),
col3 = str_extract(`Tax on this income`, "(?<=\\$)\\d+$"))
# Taxable income Tax on this income col1 col2 col3
#1 0 – $18,200 Nil <NA> <NA> <NA>
#2 $18,201 – $37,000 19c for each $1 over $18200 <NA> 19 18200
#3 $37,001 – $87,000 $3572 plus 32.5c for each $1 over $37000 3572 32.5 37000
#4 $87,001 – $180,000 $19822 plus 37c for each $1 over $87000 19822 37 87000
#5 $180,001 and over $54232 plus 45c for each $1 over $180000 54232 45 180000
由于"cents"
也以"c"
开头,因此当您使用“分”而不是“ c”时,这也将起作用。
df[[19]] %>%
mutate(`Tax on this income` = gsub(",", "", `Tax on this income`),
col1 = str_extract(`Tax on this income`, "(?<=^\\$)\\d+"),
col2 = str_extract(`Tax on this income`, "\\d+.(\\d+)?(?=(\\s+)?c)"),
col3 = str_extract(`Tax on this income`, "(?<=\\$)\\d+$"))
# Taxable income Tax on this income col1 col2 col3
#1 $1 – $5,400 Nil <NA> <NA> <NA>
#2 $5,401 – $20,700 20 cents for each $1 over $5400 <NA> 20 5400
#3 $20,701 – $38,000 $3060 plus 34 cents for each $1 over $20700 3060 34 20700
#4 $38,001 – $50,000 $8942 plus 43 cents for each $1 over $38000 8942 43 38000
#5 $50,001 and over $14102 plus 47 cents for each $1 over $50000 14102 47 50000
有了数据框列表,您可以使用map
将其应用于每个数据框
purrr::map(df,.%>%
mutate(`Tax on this income` = gsub(",", "", `Tax on this income`),
col1 = str_extract(`Tax on this income`, "(?<=^\\$)\\d+"),
col2 = str_extract(`Tax on this income`, "\\d+.(\\d+)?(?=(\\s+)?c)"),
col3 = str_extract(`Tax on this income`, "(?<=\\$)\\d+$")))
答案 1 :(得分:1)
pattern = "(?:\\$(\\S+)\\s*plus\\s*)?(\\d++[.]?\\d*)\\s*c.*\\$(\\d++,.*)|.*Nil.*"
clean = function(x){
nw = gsub(',','',trimws(gsub(pattern,'\\1:\\2:\\3',x[,2],perl=T)))
cbind(x,read.table(text = nw,fill=T,sep = ':',col.names = paste0('col',1:3)))
}
lapply(df,clean)
`Resident tax rates for 1983-84`
Taxable income Tax on this income col1 col2 col3
1 $1 – $4,594 Nil NA NA NA
2 $4,595 – $19,499 30 cents for each $1 over $4,595 NA 30 4595
3 $19,500 – $35,787 $4,471.50 plus 46 cents for each $1 over $19,500 4471.50 46 19500
4 $35,788 and over $11,963.98 plus 60 cents for each $1 over $35,788 11963.98 60 35788
答案 2 :(得分:0)