Question

我已经尝试过你的textcat包和函数，这通常会得到满意的结果，但是我希望能解决一些异常现象。

例如，字符串“a good thing”，无论字母的外壳排列如何，都将返回“苏格兰”而不是“英语”。

如果我用字符串或代码尝试它，同样的事情发生了：

textcat("The human species learned long, long ago that sticking together is a good thing.")
[1] "scots"
textcat("A good thing.")
[1] "scots"

我也尝试了其他软件包，例如cld2，cld3和franc，以及其他一些软件包。

detect_language("long ago that sticking together is a good thing")
[1] "en"

包cld2提供了正确的分类，即“en”，但我没有对我的训练和测试数据集进行更彻底的尝试。

包cld3的返回值与cld2相同。

library("cld3", lib.loc="~/R/win-library/3.3")

detect_language("long ago that sticking together is a good thing")
[1] "en"

franc包返回“sco”，与textcat一致。

franc("The human species learned long, long ago that sticking together is a good thing.")
[1] "sco"

Answer 1

从软件包开发人员那里获得解决方案。丢弃苏格兰人是一种选择。苏格兰人在这个问题上指的是低地苏格兰人，因此仍然指的是德国英语。嗯，我怀疑那么多（“，）...... R＆gt; textcat :: textcat（”很久很久以前就知道了人类物种，R＆gt;粘在一起是件好事。“，textcat :: TC_char_profiles [-43]） [1]“英语”

textcat missclassification：英文报道为苏格兰人

1 个答案: