Question

我有以下字符串：

string = "asflkjsdhlkjsdhglk<body>Iwant\to+extr@ctth!sstr|ng<body>sdgdfsghsghsgh"

我想在两个<body>标记之间提取字符串。我要找的结果是：

substring = "<body>Iwant\to+extr@ctth!sstr|ng<body>"

请注意，两个<body>标记之间的子字符串可以包含字母，数字，标点符号和特殊字符。

有一种简单的方法吗？谢谢！

Answer 1

这是正则表达方式：

regmatches(string, regexpr('<body>.+<body>', string))

Answer 2

regex = '<body>.+?<body>'

您希望非贪婪（.+?），以便它不会对尽可能多的<body>个标记进行分组。

如果您只使用没有辅助功能的正则表达式，那么您将需要一个捕获组来提取所需内容，即：

regex = '(<body>.+?<body>)'

Answer 3

strsplit（）可以帮助你：

>string = "asflkjsdhlkjsdhglk<body>Iwant\to+extr@ctth!sstr|ng<body>sdgdfsghsghsgh"
>x = strsplit(string, '<body>', fixed = FALSE, perl = FALSE, useBytes = FALSE)
[[1]]
[1] "asflkjsdhlkjsdhglk"         "Iwant\to+extr@ctth!sstr|ng" "sdgdfsghsghsgh"  
> x[[1]][2]
[1] "Iwant\to+extr@ctth!sstr|ng"

当然，这会为您提供字符串的所有三个部分，并且不包含标记。

Answer 4

我相信马修和史蒂夫的答案都是可以接受的。这是另一种解决方案：

string = "asflkjsdhlkjsdhglk<body>Iwant\to+extr@ctth!sstr|ng<body>sdgdfsghsghsgh"

regmatches(string, regexpr('<body>.+<body>', string))

output = sub(".*(<body>.+<body>).*", "\\1", string)

print (output)

从字符串中提取两个单词之间的子字符串

4 个答案: