R的斯坦福CoreNLP:西班牙语不起作用

时间:2016-06-17 14:51:32

标签: r stanford-nlp

我开始在R中使用Stanford CoreNLP软件包,以便用西班牙语进行一些文本分析。所以,我尝试以下方法:

R

R version 3.2.2 (2015-08-14) -- "Fire Safety"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> install.packages("coreNLP")
Installing package into ‘/home/ach/R/x86_64-pc-linux-gnu-library/3.2’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
trying URL 'https://cran.rediris.es/src/contrib/coreNLP_0.4-1.tar.gz'
Content type 'application/x-gzip' length 17392 bytes (16 KB)
==================================================
downloaded 16 KB

* installing *source* package ‘coreNLP’ ...
** package ‘coreNLP’ successfully unpacked and MD5 sums checked
** R
** data
*** moving datasets to lazyload DB
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (coreNLP)

The downloaded source packages are in
    ‘/tmp/RtmpO3q77z/downloaded_packages’
> library(coreNLP)
> downloadCoreNLP(type="base")
trying URL 'http://nlp.stanford.edu/software//stanford-corenlp-full-2015-04-20.zip'
Content type 'application/zip' length 360824440 bytes (344.1 MB)
==================================================
downloaded 344.1 MB

[1] 0
> 
> downloadCoreNLP(type="spanish")
trying URL 'http://nlp.stanford.edu/software//stanford-spanish-corenlp-2015-01-08-models.jar'
Content type 'application/x-java-archive' length 25007256 bytes (23.8 MB)
==================================================
downloaded 23.8 MB

> initCoreNLP()
Searching for resource: config.properties
Adding annotator tokenize
TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.7 sec].
Adding annotator lemma
Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [3.5 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [1.2 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [2.3 sec].
Initializing JollyDayHoliday for SUTime from classpath: edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/defs.sutime.txt
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.sutime.txt
Reading TokensRegex rules from edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
Adding annotator parse
Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.4 sec].
Adding annotator dcoref
Adding annotator sentiment
> > sInes <- "Hola padre. Acabo de llegar a casa. Tengo ganas de cenar"
> annotation <- annotateString(sInes)
> token <- getToken(annotation)
> token[token$sentence==2,c(1:4,7)]
  sentence id  token  lemma POS
4        2  1  Acabo  Acabo NNP
5        2  2     de     de NNP
6        2  3 llegar llegar NNP
7        2  4      a      a  DT
8        2  5   casa   casa  FW
9        2  6      .      .   .

一切似乎都很好(据我所见,没有错误可见),但它不起作用。例如,&#34; casa&#34;被标记为外来词(FW),这是不正确的。

那么,有没有人对此有任何想法?

非常感谢

奥古斯丁

2 个答案:

答案 0 :(得分:3)

您不仅需要下载西班牙语,还需要将tokenizer设置为西班牙语:

props.setProperty("tokenize.language", "es");

答案 1 :(得分:2)

该软件包的作者最近进行了更新,使更改语言设置变得轻而易举。

# update to newest version of the package
devtools::install_github("statsmaths/coreNLP")

# download base library (mandatory):
coreNLP::downloadCoreNLP()

# download desired language library:
coreNLP::downloadCoreNLP(type="spanish")

# attach package
library(coreNLP)

# run initCoreNLP specifying your language of choice
initCoreNLP(type="spanish")