Question

我在使用语言检测功能textcat()时遇到了问题。

library(textcat)

textcat('ogłoszenie')
# [1] "polish"

textcat('OGŁOSZENIE')
# [1] "slovenian-iso8859_2"

“Ogłoszenie”是一个很好的词，但是当用大写字母书写时，它被认为是斯洛文尼亚语。有谁知道如何避免这个问题？

现在我在文字上使用tolower()。

Answer 1

默认情况下，它使用textcat::TC_char_profiles个人资料，其中tolower设置为FALSE。我们可以创建新的个人资料并将其更改为TRUE，见下文：

library(textcat)

# create a new profile with tolower option TRUE
myProfile <- textcat::TC_char_profiles
attributes(myProfile)$options$tolower <- TRUE

textcat('OGŁOSZENIE', p = myProfile)
# [1] "polish"

或者我们可以做，就像你建议的解决方法一样：

textcat(tolower('OGŁOSZENIE'))
# [1] "polish"

textcat函数区分大小写，是不是一个bug？

1 个答案: