Question

我希望我的grep函数从多个文件中提取不同格式的百分比。例如，它们可以通过以下方式编写：（5％，2.46％，12.9％，5％，2.46％，5 12.9％，5％，2.46％，5％，等等）我想要确保前后至少有一个空格以避免提取HTML代码或类似的内容：

<TD width="97%"></TD>

这是我正在使用的代码，这显然是错误的，我在想，也许有一种方法可以放置像下面的星号一样的占位符，它会像这样找到各种数字：

  txt<-tryCatch(readLines(DS2[i,temp]), error = function(e) readLines(DS2[i,temp] ))
  t<-grep("**.**%", txt)

Answer 1

不是编写单个正则表达式，而是在多个步骤中执行它可能更容易。使用您提供的示例：

x <- c('5%', '2.46%', '12.9%', '5 %', '2.46 %', '5 12.9 %', 
       '5 percent', '2.46 percent', '5 per cent', 
        'etc..', '<TD width="97%"></TD>')

get_pct <- function(x) {
    x <- gsub('="[^"]+%"', '', x)
    x <- gsub('\\s*per\\s*cent|\\s*%', '%', x)
    is_pct <- grepl('\\d+(\\.\\d+)?', x)
    as.numeric(ifelse(is_pct, gsub('.*?(\\d+\\.?\\d*)%.*', '\\1\\2', x), NA))
}

f(x)
[1]  5.00  2.46 12.90  5.00  2.46 12.90  5.00  2.46  5.00    NA    NA

这是一步一步的事情

# Eliminate percentages from html tags
x <- gsub('="[^"]+%"', '', x)
x
[1] "5%"              "2.46%"           "12.9%"           "5 %"             "2.46 %"          "5 12.9 %"       
[7] "5 percent"       "2.46 percent"    "5 per cent"      "etc.."           "<TD width></TD>"

# Standardize % symbol
x <- gsub('\\s*per\\s*cent|\\s*%', '%', x)
x
[1] "5%"              "2.46%"           "12.9%"           "5%"              "2.46%"           "5 12.9%"        
[7] "5%"              "2.46%"           "5%"              "etc.."           "<TD width></TD>"

# Find percentages
is_pct <- grepl('\\d+(\\.\\d+)?', x)

# Extract values
x <- ifelse(is_pct, gsub('.*?(\\d+\\.?\\d*)%.*', '\\1\\2', x), NA)
as.numeric(x)
[1]  5.00  2.46 12.90  5.00  2.46 12.90  5.00  2.46  5.00    NA    NA

如何从r中的文件中获取任何格式的百分比？

1 个答案: