Question

我是编程的初学者，我试图使用coreNLP对电影字幕执行标记化。

我已经从字幕文件中提取了所有句子，并将其制成一个.txt文件，如下所示：

It's another hot and sunny
The temperature in downtown
And at night will drop to...
I think about that day
I left him at the Greyhound station
West of Santa Fe
We were 17 but he was sweet and it was true
Still I did what I had to do
Cause I just knew
...

我运行的命令是

java -cp "*" -Xmx500m edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize -outputFormat text -file La.La.Land.2016.DVDScr.XVID.AC3.HQ.Hive-CM8.txt

我期望输出是单独的标记，但是，输出只包含单独的字母，就像这样：

[Text=I CharacterOffsetBegin=2 CharacterOffsetEnd=3]
[Text=t CharacterOffsetBegin=4 CharacterOffsetEnd=5]
[Text=' CharacterOffsetBegin=6 CharacterOffsetEnd=7]
[Text=s CharacterOffsetBegin=8 CharacterOffsetEnd=9]
[Text=a CharacterOffsetBegin=12 CharacterOffsetEnd=13]
[Text=n CharacterOffsetBegin=14 CharacterOffsetEnd=15]
[Text=o CharacterOffsetBegin=16 CharacterOffsetEnd=17]
...

由于我对编程和coreNLP还是相当陌生，因此我似乎找不到解决此问题的方法，因为其他示例input.txt似乎工作正常。

任何帮助将不胜感激！

CoreNLP：为什么输出结果仅包含单独的字母而不包含单独的单词？

0 个答案: