Question

我有R对象，其中包含域名和IP地址。例如。

11.22.44.55.test.url.com.localhost

我在R中使用regex来捕获IP地址。我的问题是，当没有匹配时，整个字符串会匹配或“输出”。当我处理一个非常大的数据集时，这就成了一个问题。我目前使用正则表达式

sub("([0-9]+)\\.([0-9]+)\\.([0-9]+)\\.([0-9]+).*","\\1.\\2.\\3.\\4","11.22.44.55.test.url.com.localhost")

给了我11.22.44.55

11.22.44.55

但如果我必须遵循

sub("([0-9]+)\\.([0-9]+)\\.([0-9]+)\\.([0-9]+).*","\\1.\\2.\\3.\\4","11.22.44.test.url.com.localhost")

然后它给了我

11.22.44.test.url.com.localhost

实际上不正确。想知道是否有任何解决方案。

Answer 1

您可以使用grep进行预处理，只获取按照您希望的方式格式化的字符串，然后对其使用gsub。

x <- c("11.22.44.55.test.url.com.localhost", "11.22.44.test.url.com.localhost")
gsub("((\\d+\\.){3}\\d+)(.*)", "\\1",  grep("(\\d+\\.){4}", x, value=TRUE))
#[1] "11.22.44.55"

Answer 2

确实，您的代码正在运行。当sub()无法匹配时，它将返回原始字符串。从手册：

对于sub和gsub，返回一个长度相同且属性与x相同的字符向量（在可能强制转换为字符之后）。 未替换的字符向量x的元素将保持不变（包括任何声明的编码）。如果useBytes = FALSE，非ASCII替换结果通常是带有标记编码的UTF-8（例如，如果存在UTF-8输入，并且在多字节语言环境中，除非fixed = TRUE）。这些字符串可以通过enc2native重新编码。

强调添加

Answer 3

你可以尝试这种模式：

(?:\d{1,3}+\.){3}+\d{1,3}

我用Java测试过它：

static final Pattern p = Pattern.compile("(?:\\d{1,3}+\\.){3}+\\d{1,3}");

public static void main(String[] args) {
    final String s1 = "11.22.44.55.test.url.com.localhost";
    final String s2 = "11.24.55.test.url.com.localhost";
    System.out.println(getIps(s1));
    System.out.println(getIps(s2));
}

public static List<String> getIps(final String string) {
    final Matcher m = p.matcher(string);
    final List<String> strings = new ArrayList<>();
    while (m.find()) {
        strings.add(m.group());
    }
    return strings;
}

输出：

[11.22.44.55]
[]

Answer 4

查看gsubfn包中的gsubfn或strapply函数。当您想要返回匹配而不是替换它时，这些函数比sub更好。

仅使用R捕获IP地址

4 个答案: