Question

我有一个统计文件，其中包含如下行： “ system.l2.compressor.compression_size :: 1 0＃压缩为1位的块数”

0是我在这种情况下关心的值。实际统计数据与之前和之后的统计数据之间的间隔每次都不相同。

我的代码可以尝试获取统计信息。

if (grepl("system.l2.compressor.compression_size::1", line))
    {
      matches <- regmatches(line, gregexpr("[[:digit:]]+\\.*[[:digit:]]", line))
      compression_size_1 = as.numeric(unlist(matches))[1]
    }

我有这个正则表达式的原因

[[:digit:]]+\\.*[[:digit:]]

是因为在其他情况下，统计信息是十进制数。我不希望在像我发布的示例那样的情况下将数字设置为小数，但是最好有一个“防故障”正则表达式也可以捕获这种情况。

在这种情况下，我得到“ 2”。以“ 1”，“ 0”，“ 1”作为答案。如何限制它，以便我只能得到真实的统计信息？

我尝试使用类似的东西

"[:space:][[:digit:]]+\\.*[[:digit:]][:space:]"

或其他变体，但是我得到了NA，或者相同的数字，但周围有空格。

Answer 1

根据数据的设置方式，有两种基本的R可能性。将来，提供一个可复制的示例会很有帮助。如果这些方法不起作用，请绝对提供。如果该模式有效，则使其适应stringr或stringi函数可能会更快。祝你好运！

# The digits after the space after the anything not a space following "::"
gsub(".*::\\S+\\s+(\\d+).*", "\\1", strings)
[1] "58740" "58731" "70576"

# Getting the digit(s) following a space and preceding a space and pound sign
gsub(".*\\s+(\\d+)\\s+#.*", "\\1", strings)
[1] "58740" "58731" "70576"

# Combining the two (this is the most restrictive)
gsub(".*::\\S+\\s+(\\d+)\\s+#.*", "\\1", strings)
[1] "58740" "58731" "70576"

# Extracting the first digits surounded by spaces (least restrictive)
gsub(".*?\\s+(\\d+)\\s+.*", "\\1", strings)
[1] "58740" "58731" "70576"

# Or, using stringr for the last pattern:
as.numeric(stringr::str_extract(strings, "\\s+\\d+\\s+"))
[1] 58740 58731 70576

编辑：第二个解释：

gsub(".*\\s+(\\d+)\\s+#.*", "\\1", strings)

.*-. = \ n以外的任何字符； * =任意次数
\\s+-\\s =空格； + =（至少一个）实例（空白）
(\\d+)-() =捕获组，您以后可以通过出现的次数来引用它（即”\\1”返回此模式的第一个实例）； \\d =位数； + =至少一个实例（一个数字）
\\s+#-\\s =空格； + =至少一个（空白）实例； #字面上的英镑符号
.*-. = \ n以外的任何字符； * =任意次数

数据：

strings <- c("system.l2.compressor.compression_size::256 58740 # Number of blocks that compressed to fit in 256 bits",
             "system.l2.compressor.encoding::Base*.8_1 58731 # Number of data entries that match encoding Base8_1",
             "system.l2.overall_hits::.cpu.data 70576 # number of overall hits")

R的正则表达式-文本前后的空格

1 个答案: