Question

我有一个带句子解析的字符串，想要从开始和结束括号中包含的字符串中提取/解析。问题在于，还需要抓取相同类型的其他括号（在这种情况下为括号）。所以基本上我需要有与NP相关联的正确数量的开括号等于相同数量的闭括号。

在这个例子中：

x <- "(TOP (S (NP (NNP Clifford)) (NP (DT the) (JJ big) (JJ red) (NN dog)) (VP (VBD ate) (NP (PRP$ my) (NN lunch)))(. .)))"

我想说我想把名词短语（NP）提取到下面的三个子串中：

(NP (NNP Clifford))
(NP (DT the) (JJ big) (JJ red) (NN dog))
(NP (PRP$ my) (NN lunch))

这可以推广到字符串的所有部分，比如说我想抓住VP括号，我可以遵循相同的逻辑。

Answer 1

平衡括号is not regular的语言，因此无法与基本正则表达式匹配。你可以使用递归正则表达式（为hwnd＆＃39;答案）这样做，但我不推荐它，因为语法变得相当丑陋。相反，使用更简单的正则表达式，变量和程序控制流来构建解析器。像这样：

for each character:
    if it's a (, increment the nesting depth.
    if it's a ), decrement the nesting depth.
    if the nesting depth is exactly zero, we've reached the end of this expression.

或者，使用像openNLP这样的库，它已经能够为你做这个解析了。

Answer 2

我不确定是否总是会定义子字符串，但在这种情况下你可以这样做：

regmatches(x, 
    gregexpr('(?x)
              (?=\\(NP)           # assert that subpattern precedes
                (                 # start of group 1
                \\(               # match open parenthesis
                    (?:           # start grouping construct
                        [^()]++   # one or more non-parenthesis (possessive)
                          |       # OR
                        (?1)      # found ( or ), recurse 1st subpattern
                    )*            # end grouping construct
                \\)               # match closing parenthesis
                )                 # end of group 1
             ', x, perl=TRUE))[[1]]

# [1] "(NP (NNP Clifford))"                     
# [2] "(NP (DT the) (JJ big) (JJ red) (NN dog))"
# [3] "(NP (PRP$ my) (NN lunch))"

Answer 3

您可以使用Avinash Raj的新套餐：

library(dangas)
extract_all_a("(NP", "))", x, delim=TRUE)
[[1]]
[1] "(NP (NNP Clifford))"                     
[2] "(NP (DT the) (JJ big) (JJ red) (NN dog))"
[3] "(NP (PRP$ my) (NN lunch))"

Github链接Here。使用：devtools::install_github("Avinash-Raj/dangas/dangas")

安装

如果您在下载时遇到问题，请尝试：

library(stringr)
str_extract_all(x, "\\(NP.*?\\)\\)")

<强>更新

@Kevin正确告诉我，我忽略了平衡的paranthesis请求。但正如您在评论中提到的，您可能不需要它来解决您的问题。请报告是否有帮助，如果没有，我会删除。

正则表达式：用中间的其他括号解析左右括号

3 个答案: