hive ngram停用词列表?

时间:2012-08-15 16:09:39

标签: hadoop hive n-gram

虽然被列为example use cases之一...我还没有找到从Hive n-gram中过滤出垃圾词(和/或等)的例子。

SELECT explode(context_ngrams(sentences(lower(description)), array("criminal", null), 10)) AS x FROM mapped_discussions;

{"ngram":["justice"],"estfrequency":274.0}
{"ngram":["behavior"],"estfrequency":121.0}
{"ngram":["law"],"estfrequency":92.0}
{"ngram":["activity"],"estfrequency":69.0}
{"ngram":["acts"],"estfrequency":41.0}
{"ngram":["procedure"],"estfrequency":35.0}
{"ngram":["and"],"estfrequency":29.0}
{"ngram":["or"],"estfrequency":27.0}
{"ngram":["case"],"estfrequency":26.0}
{"ngram":["cases"],"estfrequency":25.0}

有什么想法吗?谢谢!

1 个答案:

答案 0 :(得分:3)