CoreNLP:为什么输出结果仅包含单独的字母而不包含单独的单词?

时间:2018-10-29 03:00:13

标签: stanford-nlp tokenize

我是编程的初学者,我试图使用coreNLP对电影字幕执行标记化。

我已经从字幕文件中提取了所有句子,并将其制成一个.txt文件,如下所示:

It's another hot and sunny
The temperature in downtown
And at night will drop to...
I think about that day
I left him at the Greyhound station
West of Santa Fe
We were 17 but he was sweet and it was true
Still I did what I had to do
Cause I just knew
...

我运行的命令是

java -cp "*" -Xmx500m edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize -outputFormat text -file La.La.Land.2016.DVDScr.XVID.AC3.HQ.Hive-CM8.txt

我期望输出是单独的标记,但是,输出只包含单独的字母,就像这样:

[Text=I CharacterOffsetBegin=2 CharacterOffsetEnd=3]
[Text=t CharacterOffsetBegin=4 CharacterOffsetEnd=5]
[Text=' CharacterOffsetBegin=6 CharacterOffsetEnd=7]
[Text=s CharacterOffsetBegin=8 CharacterOffsetEnd=9]
[Text=a CharacterOffsetBegin=12 CharacterOffsetEnd=13]
[Text=n CharacterOffsetBegin=14 CharacterOffsetEnd=15]
[Text=o CharacterOffsetBegin=16 CharacterOffsetEnd=17]
...

由于我对编程和coreNLP还是相当陌生,因此我似乎找不到解决此问题的方法,因为其他示例input.txt似乎工作正常。

任何帮助将不胜感激!

0 个答案:

没有答案