我正在做一个文本挖掘项目,将分析其余三位总统候选人的一些演讲。我已使用OpenNLP
完成了POS标记,并创建了带有结果的两列数据框。我添加了一个名为pair
的变量。以下是克林顿数据框的样本:
V1 V2 pair
1 c( NN FALSE
2 "thank VBP FALSE
3 you PRP FALSE
4 so RB FALSE
5 much RB FALSE
6 . . FALSE
7 it PRP FALSE
8 is VBZ FALSE
9 wonderful JJ FALSE
10 to TO FALSE
11 be VB FALSE
12 here RB FALSE
13 and CC FALSE
14 see VB FALSE
15 so RB FALSE
16 many JJ FALSE
17 friends NNS FALSE
18 . . FALSE
19 ive JJ FALSE
20 spoken VBN FALSE
我现在要做的是编写一个函数,它将遍历V2
POS列,并针对特定模式对进行评估。 (这些来自Turney's PMI article。)在编写函数时我还不是很了解,所以我确定我做错了,但这就是我&#39到目前为止。
pairs <- function(x){
JJ <- "JJ" #adjectives
N <- "N[A-Z]" #any noun form
R <- "R[A-Z]" #any adverb form
V <- "V[A-Z]" #any verb form
for(i in 1:(length)(x) {
if(x == J && x+1 == N) { #i.e., if the first word = J and the next = N
pair[i] <- "JJ|NN" #insert this into the 'pair' variable
} else if (x == R && x+1 == J && x+2 != N) {
pair[i] <- "RB|JJ"
} else if (x == J && x+1 == J && x+2 != N) {
pair[i] <- "JJ|JJ"
} else if (x == N && x+1 == J && x+2 != N) {
pair[i] <- "NN|JJ"
} else if (x == R && x+1 == V) {
pair[i] <- "RB|VB"
} else {
pair[i] <- "FALSE"
}
}
}
# Run the function
cl.df.pairs <- pairs(cl.df$V2)
有许多(真正令人尴尬的)问题。首先,当我尝试运行功能代码时,最后会出现两个Error: unexpected '}' in " }"
错误。我无法弄清楚原因,因为它们与开放相匹配&#34; {&#34;。我假设它是因为R期待其他东西在那里。
此外,更重要的是,这个功能不能完全找到我想要的东西,即提取与模式匹配的单词对,然后提取它们匹配的模式。老实说,我不知道该怎么做。
然后我需要弄清楚如何通过将短语与我所拥有的pos / neg词汇数据集进行比较来评估每个单词组合的语义方向,但这是另一个问题。我有文章中的公式,我希望能指出我正确的方向。
我已经全神贯注,无法在任何NLP包中找到类似的功能,例如OpenNLP
,RTextTools
等。我已经查看了其他SO问题/答案,例如this one和this one,但当我尝试调整它们时,它们对我没用。我相当肯定我在这里遗漏了一些明显的东西,所以我们会感激任何建议。
编辑:
以下是Sanders数据框的前20行。
head(sa.POS.df, 20)
V1 V2
1 the DT
2 american JJ
3 people NNS
4 are VBP
5 catching VBG
6 on RB
7 . .
8 they PRP
9 understand VBP
10 that IN
11 something NN
12 is VBZ
13 profoundly RB
14 wrong JJ
15 when WRB
16 , ,
17 in IN
18 our PRP$
19 country NN
20 today NN
我已经写了以下功能:
pairs <- function(x, y) {
require(gsubfn)
J <- "JJ" #adjectives
N <- "N[A-Z]" #any noun form
R <- "R[A-Z]" #any adverb form
V <- "V[A-Z]" #any verb form
for(i in 1:(length(x))) {
ngram <- c(x[[i]], x[[i+1]])
# the ngram consists of the word on line `i` and the word below line `i`
}
strapply(y[i], "(J)\n(N)", FUN = paste(ngram, sep = " "), simplify = TRUE)
ngrams.df = data.frame(ngrams=ngram)
return(ngrams.df)
}
所以,支持发生的是当strapply
匹配模式时(在这种情况下,形容词后跟名词,它应该paste
ngram。并且所有得到的ngram应该填充ngrams.df
。
所以我输入了以下函数调用并收到错误:
> sa.JN <- pairs(x=sa.POS.df$V1, y=sa.POS.df$V2)
Error in x[[i + 1]] : subscript out of bounds
我只是学习正则表达式的复杂性,所以我不太确定如何使用我的函数来提取实际的形容词和名词。根据这里显示的数据,它应该拉动&#34;美国&#34;和&#34;人和#34;并将它们粘贴到数据框中。
答案 0 :(得分:1)
我认为以下是您编写的代码,但没有抛出错误:
pairs <- function(x) {
J <- "JJ" #adjectives
N <- "N[A-Z]" #any noun form
R <- "R[A-Z]" #any adverb form
V <- "V[A-Z]" #any verb form
pair = rep("FALSE", length(x))
for(i in 1:(nrow(x)-2)) {
this.pos = x[i,2]
next.pos = x[i+1,2]
next.next.pos = x[i+2,2]
if(this.pos == J && next.pos == N) { #i.e., if the first word = J and the next = N
pair[i] <- "JJ|NN" #insert this into the 'pair' variable
} else if (this.pos == R && next.pos == J && next.next.pos != N) {
pair[i] <- "RB|JJ"
} else if (this.pos == J && next.pos == J && next.next.pos != N) {
pair[i] <- "JJ|JJ"
} else if (this.pos == N && next.pos == J && next.next.pos != N) {
pair[i] <- "NN|JJ"
} else if (this.pos == R && next.pos == V) {
pair[i] <- "RB|VB"
} else {
pair[i] <- "FALSE"
}
}
## then deal with the last two elements, for which you can't check what's up next
return(pair)
}
不确定你的意思,但是:
另外,更重要的是,这个功能不会让我完全理解我 想要,即提取与模式匹配的单词对然后 他们匹配的模式。老实说,我不知道该怎么做。
答案 1 :(得分:1)
好的,我们走了。使用此数据(与dput()
很好地共享):
df = structure(list(V1 = structure(c(15L, 3L, 11L, 4L, 5L, 9L, 2L,
16L, 18L, 14L, 13L, 8L, 12L, 20L, 19L, 1L, 7L, 10L, 6L, 17L), .Label = c(",",
".", "american", "are", "catching", "country", "in", "is", "on",
"our", "people", "profoundly", "something", "that", "the", "they",
"today", "understand", "when", "wrong"), class = "factor"), V2 = structure(c(3L,
5L, 7L, 12L, 11L, 10L, 2L, 8L, 12L, 4L, 6L, 13L, 10L, 5L, 14L,
1L, 4L, 9L, 6L, 6L), .Label = c(",", ".", "DT", "IN", "JJ", "NN",
"NNS", "PRP", "PRP$", "RB", "VBG", "VBP", "VBZ", "WRB"), class = "factor")), .Names = c("V1",
"V2"), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18", "19", "20"))
我会使用stringr
包,因为它的语法一致,所以我不必查找grep
的参数顺序。我们首先检测形容词,然后是名词,然后找出排队的位置(偏移1)。然后将与匹配对应的单词粘贴在一起。
library(stringr)
adj = str_detect(df$V2, "JJ")
noun = str_detect(df$V2, "NN")
pairs = which(c(FALSE, adj) & c(noun, FALSE))
ngram = paste(df$V1[pairs - 1], df$V1[pairs])
# [1] "american people"
现在我们可以把它放在一个函数中。为了灵活性,我把模式作为参数(用形容词,名词作为默认值)。
bigram = function(word, type, patt1 = "JJ", patt2 = "N[A-Z]") {
pairs = which(c(FALSE, str_detect(type, pattern = patt1)) &
c(str_detect(type, patt2), FALSE))
return(paste(word[pairs - 1], word[pairs]))
}
展示对原始数据的使用
with(df, bigram(word = V1, type = V2))
# [1] "american people"
让我们用多个匹配来制作一些数据,以确保它有效:
df2 = data.frame(w = c("american", "people", "hate", "a", "big", "bad", "bank"),
t = c("JJ", "NNS", "VBP", "DT", "JJ", "JJ", "NN"))
df2
# w t
# 1 american JJ
# 2 people NNS
# 3 hate VBP
# 4 a DT
# 5 big JJ
# 6 bad JJ
# 7 bank NN
with(df2, bigram(word = w, type = t))
# [1] "american people" "bad bank"
回到原版测试不同的模式:
with(df, bigram(word = V1, type = V2, patt1 = "N[A-Z]", patt2 = "V[A-Z]"))
# [1] "people are" "something is"