Question

在我的新闻文章中，我想将引用同一个政党的几个不同的ngram转换为缩写。之所以这样做，是因为我希望避免任何情感词典将党名（自由党）中的单词与在不同情况下（自由主义帮助下）的同一个单词混淆。

我可以在下面用str_replace_all来做到这一点，并且我知道Quanteda中的token_compound()函数，但是它似乎并不能完全满足我的需要。

library(stringr)
text<-c('a text about some political parties called the new democratic party the new democrats and the liberal party and the liberals')
text1<-str_replace_all(text, '(liberal party)|liberals', 'olp')
text2<-str_replace_all(text1, '(new democrats)|new democratic party', 'ndp')

我应该以某种方式对文本进行预处理，然后再将其转换为语料库吗？还是在quanteda中将其转换为语料库之后，有没有办法做到这一点。

以下是一些扩展的示例代码，这些代码更好地说明了问题：

`text<-c('a text about some political parties called the new democratic party 
the new democrats and the liberal party and the liberals. I would like the 
word democratic to be counted in the dfm but not the words new democratic. 
The same goes for liberal helpings but not liberal party')
partydict <- dictionary(list(
olp = c("liberal party", "liberals"),
ndp = c("new democrats", "new democratic party"),
sentiment=c('liberal', 'democratic')
))

dfm(text, dictionary=partydict)`

此示例在democratic和new democratic的意义上都将democratic进行计数，但是我将它们分别计数。

Answer 1

在定义了将规范的宴会方标签定义为键并将所有宴会方名称的ngram变体作为值列出之后，您需要函数tokens_lookup()。通过设置exclusive = FALSE，它将保留不匹配的标记，实际上是用规范的参与者名称替换所有变体。

在下面的示例中，我对您的输入文本进行了一些修改，以说明将政党名称与使用“自由党”而非“自由党”的词组进行组合的方式。

library("quanteda")

text<-c('a text about some political parties called the new democratic party 
         which is conservative the new democrats and the liberal party and the 
         liberals which are liberal helping poor people')
toks <- tokens(text)

partydict <- dictionary(list(
    olp = c("liberal party", "the liberals"),
    ndp = c("new democrats", "new democratic party")
))

(toks2 <- tokens_lookup(toks, partydict, exclusive = FALSE))
## tokens from 1 document.
## text1 :
##  [1] "a"            "text"         "about"        "some"         "political"    "parties"     
##  [7] "called"       "the"          "NDP"          "which"        "is"           "conservative"
## [13] "the"          "NDP"          "and"          "the"          "OLP"          "and"         
## [19] "OLP"          "which"        "are"          "liberal"      "helping"      "poor"        
## [25] "people"

因此，已用聚会密钥替换了聚会名称差异。现在，在这些新令牌上使用这些新令牌构建dfm，从而保留了可能与情感相关的（例如）“自由”的使用，但是已经合并了“自由党”并用“ OLP”代替了。现在，将字典应用于dfm将适用于“自由帮助”中的“自由”示例，而无需将其与聚会名称中使用“自由”相混淆。

sentdict <- dictionary(list(
    left = c("liberal", "left"),
    right = c("conservative", "")
))

dfm(toks2) %>%
    dfm_lookup(dictionary = sentdict, exclusive = FALSE)
## Document-feature matrix of: 1 document, 19 features (0% sparse).
## 1 x 19 sparse Matrix of class "dfm"
##        features
## docs    olp ndp a text about some political parties called the which is RIGHT and LEFT are helping
##  text1   2   2 1    1     1    1         1       1      1   3     2  1     1   2    1   1       1
##        features
## docs    poor people
##  text1    1      1

另外两个注意事项：

如果您不希望替换令牌中的键大写，请设置capkeys = FALSE。
您可以使用valuetype参数设置不同的匹配类型，包括valuetype = regex。（并且请注意，示例中的正则表达式可能格式不正确，因为ndp示例中的|运算符的作用域将是“新民主人士”或“新”，然后是“民主党”。 tokens_lookup()，您不必为此担心！）

用Quanteda代替几个ngram

1 个答案: