我正在尝试建立一个包含许多重叠术语的Quanteda词典。我相信使用正则表达式向前/向后看是解决此问题并避免错误点击的一种方法,但是我一定做错了。
text <- c("guinea", "equatorial guinea", "guinea bissau")
dict <- dictionary(list(guinea="guinea"))
dfm <- dfm(text, dictionary=dict, valuetype="regex")
colSums(dfm)
dict2 <- dictionary(list(guinea="(?<!equatorial[[:space:]])guinea"))
dfm2 <- dfm(text, dictionary=dict2, valuetype="regex")
colSums(dfm2)
dict3 <- dictionary(list(guinea="guinea(?![[:space:]]bissau)"))
dfm3 <- dfm(text, dictionary=dict3, valuetype="regex")
colSums(dfm3)
预期结果应该是
# dfm1
colSums(dfm1)
guinea
3
# dfm2
colSums(dfm2)
guinea
2
# dfm3
colSums(dfm3)
guinea
2
但是实际结果都是= 3 这是向前/向后浏览还是插入空白的问题?
答案 0 :(得分:1)
这种正则表达式匹配不起作用,因为模式不能跨越多个标记,并且在dfm(x, dictionary = ...)
调用中,它实际上是在标记文本后调用tokens_lookup()
。
有一种更简单的方法可以做到这一点,只需在字典中包含多字值即可。所以:
library("quanteda")
## Package version: 1.4.3
text <- c("guinea", "equatorial guinea", "guinea bissau")
dict <- dictionary(list(guinea = "guinea"))
dict2 <- dictionary(list(guinea = "equatorial guinea"))
dict3 <- dictionary(list(guinea = "guinea bissau"))
dfm(text, dictionary = dict)
## Document-feature matrix of: 3 documents, 1 feature (0.0% sparse).
## 3 x 1 sparse Matrix of class "dfm"
## features
## docs guinea
## text1 1
## text2 1
## text3 1
dfm(text, dictionary = dict2)
## Document-feature matrix of: 3 documents, 1 feature (66.7% sparse).
## 3 x 1 sparse Matrix of class "dfm"
## features
## docs guinea
## text1 0
## text2 1
## text3 0
dfm(text, dictionary = dict3)
## Document-feature matrix of: 3 documents, 1 feature (66.7% sparse).
## 3 x 1 sparse Matrix of class "dfm"
## features
## docs guinea
## text1 0
## text2 0
## text3 1