Question

CoreNLP是否有获取unigrams，bigrams，trigrams等的API？

例如，我有一个字符串"I have the best car "。我很想得到：

I
I have
the
the best
car

基于我传递的字符串。

Answer 1

如果您使用Java编码，请查看CoreNLP中StringUtils类中的getNgrams *函数。

您也可以使用CollectionUtils.getNgrams（StringUtils类也使用它）

Answer 2

您可以使用CoreNLP进行标记，但是为了获取n-gram，可以使用您正在使用的任何语言进行本地化。例如，如果您将其汇总到Python中，则可以使用列表切片和一些列表推导将它们分开：

>>> tokens
['I', 'have', 'the', 'best', 'car']
>>> unigrams = [tokens[i:i+1] for i,w in enumerate(tokens) if i+1 <= len(tokens)]
>>> bigrams = [tokens[i:i+2] for i,w in enumerate(tokens) if i+2 <= len(tokens)]
>>> trigrams = [tokens[i:i+3] for i,w in enumerate(tokens) if i+3 <= len(tokens)]
>>> unigrams
[['I'], ['have'], ['the'], ['best'], ['car']]
>>> bigrams
[['I', 'have'], ['have', 'the'], ['the', 'best'], ['best', 'car']]
>>> trigrams
[['I', 'have', 'the'], ['have', 'the', 'best'], ['the', 'best', 'car']]

CoreNLP非常适合做NLP繁重的工作，例如依赖关系，coref，POS标记等。如果您只是想要进行标记化，就像将消防车带到水枪战中一样，这似乎有些过分。使用类似TreeTagger的内容可能同样满足您对标记化的需求。

适用于N-gram的CoreNLP API？

2 个答案: