R函数用于模式匹配

时间:2016-05-18 14:35:23

标签: regex r function pattern-matching

我正在做一个文本挖掘项目,将分析其余三位总统候选人的一些演讲。我已使用OpenNLP完成了POS标记,并创建了带有结果的两列数据框。我添加了一个名为pair的变量。以下是克林顿数据框的样本:

           V1   V2  pair
1          c(  NN  FALSE
2      "thank VBP  FALSE
3         you PRP  FALSE
4          so  RB  FALSE
5        much  RB  FALSE
6           .   .  FALSE
7          it PRP  FALSE
8          is VBZ  FALSE
9   wonderful  JJ  FALSE
10         to  TO  FALSE
11         be  VB  FALSE
12       here  RB  FALSE
13        and  CC  FALSE
14        see  VB  FALSE
15         so  RB  FALSE
16       many  JJ  FALSE
17    friends NNS  FALSE
18          .   .  FALSE
19        ive  JJ  FALSE
20     spoken VBN  FALSE 

我现在要做的是编写一个函数,它将遍历V2 POS列,并针对特定模式对进行评估。 (这些来自Turney's PMI article。)在编写函数时我还不是很了解,所以我确定我做错了,但这就是我&#39到目前为止。

pairs <- function(x){

  JJ <- "JJ"      #adjectives
  N <- "N[A-Z]"   #any noun form
  R <- "R[A-Z]"   #any adverb form
  V <- "V[A-Z]"   #any verb form

  for(i in 1:(length)(x) {
      if(x == J && x+1 == N) {    #i.e., if the first word = J and the next = N
        pair[i] <- "JJ|NN"     #insert this into the 'pair' variable
      } else if (x == R && x+1 == J && x+2 != N) {
        pair[i] <- "RB|JJ"
      } else if  (x == J && x+1 == J && x+2 != N) {
        pair[i] <- "JJ|JJ"
      } else if (x == N && x+1 == J && x+2 != N) {
        pair[i] <- "NN|JJ"
      } else if (x == R && x+1 == V) {
        pair[i] <- "RB|VB"
         } else {
         pair[i] <- "FALSE"
         }
  }
}

# Run the function
cl.df.pairs <- pairs(cl.df$V2)

有许多(真正令人尴尬的)问题。首先,当我尝试运行功能代码时,最后会出现两个Error: unexpected '}' in " }"错误。我无法弄清楚原因,因为它们与开放相匹配&#34; {&#34;。我假设它是因为R期待其他东西在那里。

此外,更重要的是,这个功能不能完全找到我想要的东西,即提取与模式匹配的单词对,然后提取它们匹配的模式。老实说,我不知道该怎么做。

然后我需要弄清楚如何通过将短语与我所拥有的pos / neg词汇数据集进行比较来评估每个单词组合的语义方向,但这是另一个问题。我有文章中的公式,我希望能指出我正确的方向。

我已经全神贯注,无法在任何NLP包中找到类似的功能,例如OpenNLPRTextTools等。我已经查看了其他SO问题/答案,例如this onethis one,但当我尝试调整它们时,它们对我没用。我相当肯定我在这里遗漏了一些明显的东西,所以我们会感激任何建议。

编辑:

以下是Sanders数据框的前20行。

head(sa.POS.df, 20)
           V1   V2
1         the   DT
2    american   JJ
3      people  NNS
4         are  VBP
5    catching  VBG
6          on   RB
7           .    .
8        they  PRP
9  understand  VBP
10       that   IN
11  something   NN
12         is  VBZ
13 profoundly   RB
14      wrong   JJ
15       when  WRB
16          ,    ,
17         in   IN
18        our PRP$
19    country   NN
20      today   NN

我已经写了以下功能:

pairs <- function(x, y) {
  require(gsubfn)
  J <- "JJ"      #adjectives
  N <- "N[A-Z]"   #any noun form
  R <- "R[A-Z]"   #any adverb form
  V <- "V[A-Z]"   #any verb form

  for(i in 1:(length(x))) {
    ngram <- c(x[[i]], x[[i+1]]) 
# the ngram consists of the word on line `i` and the word below line `i`
  }
  strapply(y[i], "(J)\n(N)", FUN = paste(ngram, sep = " "), simplify = TRUE)

  ngrams.df = data.frame(ngrams=ngram)
  return(ngrams.df)
}

所以,支持发生的是当strapply匹配模式时(在这种情况下,形容词后跟名词,它应该paste ngram。并且所有得到的ngram应该填充ngrams.df
所以我输入了以下函数调用并收到错误:

> sa.JN <- pairs(x=sa.POS.df$V1, y=sa.POS.df$V2)
Error in x[[i + 1]] : subscript out of bounds  

我只是学习正则表达式的复杂性,所以我不太确定如何使用我的函数来提取实际的形容词和名词。根据这里显示的数据,它应该拉动&#34;美国&#34;和&#34;人和#34;并将它们粘贴到数据框中。

2 个答案:

答案 0 :(得分:1)

我认为以下是您编写的代码,但没有抛出错误:

pairs <- function(x) {

  J <- "JJ"      #adjectives
  N <- "N[A-Z]"   #any noun form
  R <- "R[A-Z]"   #any adverb form
  V <- "V[A-Z]"   #any verb form

  pair = rep("FALSE", length(x))
  for(i in 1:(nrow(x)-2)) {
    this.pos = x[i,2]
    next.pos = x[i+1,2]
    next.next.pos = x[i+2,2]
    if(this.pos == J && next.pos == N) {    #i.e., if the first word = J and the next = N
      pair[i] <- "JJ|NN"     #insert this into the 'pair' variable
    } else if (this.pos == R && next.pos == J && next.next.pos != N) {
      pair[i] <- "RB|JJ"
    } else if  (this.pos == J && next.pos == J && next.next.pos != N) {
      pair[i] <- "JJ|JJ"
    } else if (this.pos == N && next.pos == J && next.next.pos != N) {
      pair[i] <- "NN|JJ"
    } else if (this.pos == R && next.pos == V) {
      pair[i] <- "RB|VB"
    } else {
      pair[i] <- "FALSE"
    }
  }

  ## then deal with the last two elements, for which you can't check what's up next

  return(pair)
}

不确定你的意思,但是:

  

另外,更重要的是,这个功能不会让我完全理解我   想要,即提取与模式匹配的单词对然后   他们匹配的模式。老实说,我不知道该怎么做。

答案 1 :(得分:1)

好的,我们走了。使用此数据(与dput()很好地共享):

df = structure(list(V1 = structure(c(15L, 3L, 11L, 4L, 5L, 9L, 2L, 
16L, 18L, 14L, 13L, 8L, 12L, 20L, 19L, 1L, 7L, 10L, 6L, 17L), .Label = c(",", 
".", "american", "are", "catching", "country", "in", "is", "on", 
"our", "people", "profoundly", "something", "that", "the", "they", 
"today", "understand", "when", "wrong"), class = "factor"), V2 = structure(c(3L, 
5L, 7L, 12L, 11L, 10L, 2L, 8L, 12L, 4L, 6L, 13L, 10L, 5L, 14L, 
1L, 4L, 9L, 6L, 6L), .Label = c(",", ".", "DT", "IN", "JJ", "NN", 
"NNS", "PRP", "PRP$", "RB", "VBG", "VBP", "VBZ", "WRB"), class = "factor")), .Names = c("V1", 
"V2"), class = "data.frame", row.names = c("1", "2", "3", "4", 
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", 
"16", "17", "18", "19", "20"))

我会使用stringr包,因为它的语法一致,所以我不必查找grep的参数顺序。我们首先检测形容词,然后是名词,然后找出排队的位置(偏移1)。然后将与匹配对应的单词粘贴在一起。

library(stringr)
adj = str_detect(df$V2, "JJ")
noun = str_detect(df$V2, "NN")

pairs = which(c(FALSE, adj) & c(noun, FALSE))

ngram = paste(df$V1[pairs - 1], df$V1[pairs])
# [1] "american people"

现在我们可以把它放在一个函数中。为了灵活性,我把模式作为参数(用形容词,名词作为默认值)。

bigram = function(word, type, patt1 = "JJ", patt2 = "N[A-Z]") {
    pairs = which(c(FALSE, str_detect(type, pattern = patt1)) &
                      c(str_detect(type, patt2), FALSE))
    return(paste(word[pairs - 1], word[pairs]))
}

展示对原始数据的使用

with(df, bigram(word = V1, type = V2))
# [1] "american people"

让我们用多个匹配来制作一些数据,以确保它有效:

df2 = data.frame(w = c("american", "people", "hate", "a", "big", "bad",  "bank"),
                 t = c("JJ", "NNS", "VBP", "DT", "JJ", "JJ", "NN"))
df2
#          w   t
# 1 american  JJ
# 2   people NNS
# 3     hate VBP
# 4        a  DT
# 5      big  JJ
# 6      bad  JJ
# 7     bank  NN

with(df2, bigram(word = w, type = t))
# [1] "american people" "bad bank"

回到原版测试不同的模式:

with(df, bigram(word = V1, type = V2, patt1 = "N[A-Z]", patt2 = "V[A-Z]"))
# [1] "people are"   "something is"