我正在做一个包含文档分类组件的项目。我需要一个可用于预处理文档并将其转换为特征矩阵的库。有没有可用的图书馆?
答案 0 :(得分:2)
这是两个着名的Java库
基于Java
斯坦福核心NLP - http://nlp.stanford.edu/software/classifier.shtml
GATE - http://osdir.com/ml/ai.gate.general/2007-05/msg00003.html,https://gate.ac.uk/sale/tao/splitch19.html#chap:ml
基于Python
答案 1 :(得分:0)
我使用weka进行文档处理。对于大多数文档处理操作而言就足够了。它包含可用于将文档传输到要素向量的过滤器。它可以如下使用,
StringToWordVector filter = new StringToWordVector(); //create new filter for vector transformation
NGramTokenizer tokenizer = new NGramTokenizer(); //tokenizer for filter
tokenizer.setNGramMinSize(1);
tokenizer.setNGramMaxSize(1);
tokenizer.setDelimiters("\\W");
SnowballStemmer stemmer = new SnowballStemmer(); //Initializes stemmer
stemmer.setStemmer("english"); //set english stemmer
filter.setLowerCaseTokens(true);
filter.setOutputWordCounts(true);
filter.setTFTransform(true); // //Tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.
filter.setIDFTransform(true);
filter.setStopwords(new File("C:\\Users\\hp\\Desktop\\SVM implementation\\StopWordsR4.txt")); // stop word removal
filter.setTokenizer(tokenizer);
filter.setStemmer(stemmer);
有关详细信息,请使用this