Question

什么是正确的正则表达式来提取字符串“（procedure）” - 或者在括号内的一般文本中 - 来自下面的字符串

输入字符串示例

使用flutemetamol（18F）计算的正电子发射断层扫描大脑断层扫描（程序）

另一个例子

尿路感染预防（程序）

可能的方法是：

转到文本末尾，查找第一个左括号并从该位置获取子集到文本末尾
从文本开头，识别最后一个'（'char并将该位置作为子字符串结束

其他字符串可以是（提取不同的“标记”）

[1] "Xanthoma of eyelid (disorder)"                    "Ventricular tachyarrhythmia (disorder)"          
[3] "Abnormal urine odor (finding)"                    "Coloboma of iris (disorder)"                     
[5] "Macroencephaly (disorder)"                        "Right main coronary artery thrombosis (disorder)"

（寻求一般正则表达式）（或R中的解决方案甚至更好）

Answer 1

如果它是字符串的最后一部分，则此正则表达式将执行此操作：

/\(([^()]*)\)$/

解释：寻找一个开放的(并匹配其中不是(或)的所有内容，然后在字符串末尾有一个)

https://regex101.com/r/cEsQtf/1

Answer 2

sub可以使用正确的正则表达式

Text = c("Positron emission tomography using flutemetamol (18F) 
    with computed tomography of brain (procedure)",
    "Urinary tract infection prophylaxis (procedure)", 
    "Xanthoma of eyelid (disorder)",                    
    "Ventricular tachyarrhythmia (disorder)",          
    "Abnormal urine odor (finding)",                    
    "Coloboma of iris (disorder)",                   
    "Macroencephaly (disorder)",                        
    "Right main coronary artery thrombosis (disorder)")
sub(".*\\((.*)\\).*", "\\1", Text)
[1] "procedure" "procedure" "disorder"  "disorder"  "finding"   "disorder" 
[7] "disorder"  "disorder"

附录：正则表达式的详细解释
该问题要求在字符串中查找 final 括号中的内容。这个表达式有点令人困惑，因为它包括两个不同的括号用法，一个用于表示正在处理的字符串中的括号，另一个用于设置一个＆＃34;捕获组＆＃34;，我们指定哪个部分应该是由表达式返回。表达式由五个基本单元组成：

1. Initial .*   - matches everything up to the final open parenthesis. 
   Note that this is relying on "greedy matching"
2. \\(   ...    \\)   - matches the final set of parentheses. 
   Because ( by itself means something else,  we need to "escape" the 
   parentheses by preceding them with \.  That is we want the regular
   expression to say   \(  ...  \).  However, the way R interprets strings,
   if we just typed \( and \),  R would interpret the \ as escaping the (
   and so interpret this as just ( ... ).  So we escape the backslash.  
   R will interpret   \\(  ... \\)      as \( ... \) meaning the literal
   characters ( & ). 
3. ( ... )       Inside the pair in part 2
   This is making use of the special meaning of parentheses.  When we
   enclose an expression in parentheses, whatever value is inside them 
   will be stored in a variable for later use. That variable is called 
   \1,  which is what was used in the substitution pattern. Again, is 
   we just wrote \1, R would interpret it as if we were trying to escape
   the 1. Writing \\1 is interpreted as the character \ followed by 1, 
   i.e. \1.
4. Central .*    Inside the pair in part 3
   This is what we are looking for,  all characters inside the parentheses.
5. Final   .*
   This is in the expression to match any characters that may follow the 
   final set of parentheses.

子函数将使用它来替换匹配的模式（在这种情况下，字符串中的所有字符）替换模式\ 1，即变量的内容包含第一个（在我们的例子中）仅捕获的内容group - 最后括号内的东西。

Answer 3

您实际上可以使用以下内容提取字符串末尾嵌套括号内的文本：

x <- c("FELON IN POSSESSION OF AMMUNITION (ACTUAL POSSESSION) (79023)",
"FAIL TO DISPLAY REGISTRATION - POSSESSION REQUIRED (320.0605(1))")
sub(".*(\\(((?:[^()]++|(?1))*)\\))$", "\\2", x, perl=TRUE)

参见 online R demo 和 regex demo。

详情：

.* - 除换行符以外的零个或多个字符，尽可能多
($((?:[^()]++|(?1))*)$) - 捕获组 1（发生递归所必需的）：
- $ - ( 字符
- ((?:[^()]++|(?1))*) - 捕获第 2 组（我们的值）：除 ( 和 ) 之外的任何一个或多个字符或整个第 1 组模式出现零次或多次
- $ - ) 字符
$ - 字符串结束。

因此，当匹配时，整个字符串将替换为组 2 的值。如果不匹配，则字符串保持原来的状态。

正则表达式：如何从最后一个括号中提取文本

3 个答案: