我开始在R中使用Stanford CoreNLP软件包,以便用西班牙语进行一些文本分析。所以,我尝试以下方法:


> install.packages("coreNLP")
Installing package into ‘/home/ach/R/x86_64-pc-linux-gnu-library/3.2’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
trying URL 'https://cran.rediris.es/src/contrib/coreNLP_0.4-1.tar.gz'
Content type 'application/x-gzip' length 17392 bytes (16 KB)
downloaded 16 KB

* installing *source* package ‘coreNLP’ ...
** package ‘coreNLP’ successfully unpacked and MD5 sums checked
** R
** data
*** moving datasets to lazyload DB
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (coreNLP)

The downloaded source packages are in
> library(coreNLP)
> downloadCoreNLP(type="base")
trying URL 'http://nlp.stanford.edu/software//stanford-corenlp-full-2015-04-20.zip'
Content type 'application/zip' length 360824440 bytes (344.1 MB)
downloaded 344.1 MB

[1] 0
> downloadCoreNLP(type="spanish")
trying URL 'http://nlp.stanford.edu/software//stanford-spanish-corenlp-2015-01-08-models.jar'
Content type 'application/x-java-archive' length 25007256 bytes (23.8 MB)
downloaded 23.8 MB

> initCoreNLP()
Searching for resource: config.properties
Adding annotator tokenize
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.7 sec].
Adding annotator lemma
Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [3.5 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [1.2 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [2.3 sec].
Initializing JollyDayHoliday for SUTime from classpath: edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/defs.sutime.txt
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.sutime.txt
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
Adding annotator parse
Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.4 sec].
Adding annotator dcoref
Adding annotator sentiment
> > sInes <- "Hola padre. Acabo de llegar a casa. Tengo ganas de cenar"
> annotation <- annotateString(sInes)
> token <- getToken(annotation)
> token[token$sentence==2,c(1:4,7)]
  sentence id  token  lemma POS
4        2  1  Acabo  Acabo NNP
5        2  2     de     de NNP
6        2  3 llegar llegar NNP
7        2  4      a      a  DT
8        2  5   casa   casa  FW
9        2  6      .      .   .

一切似乎都很好(据我所见,没有错误可见),但它不起作用。例如,&#34; casa&#34;被标记为外来词(FW),这是不正确的。




props.setProperty("tokenize.language", "es");

# update to newest version of the package

# download base library (mandatory):

# download desired language library:

# attach package

# run initCoreNLP specifying your language of choice