我是编程的初学者,我试图使用coreNLP对电影字幕执行标记化。
我已经从字幕文件中提取了所有句子,并将其制成一个.txt文件,如下所示:
It's another hot and sunny
The temperature in downtown
And at night will drop to...
I think about that day
I left him at the Greyhound station
West of Santa Fe
We were 17 but he was sweet and it was true
Still I did what I had to do
Cause I just knew
...
我运行的命令是
java -cp "*" -Xmx500m edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize -outputFormat text -file La.La.Land.2016.DVDScr.XVID.AC3.HQ.Hive-CM8.txt
我期望输出是单独的标记,但是,输出只包含单独的字母,就像这样:
[Text=I CharacterOffsetBegin=2 CharacterOffsetEnd=3]
[Text=t CharacterOffsetBegin=4 CharacterOffsetEnd=5]
[Text=' CharacterOffsetBegin=6 CharacterOffsetEnd=7]
[Text=s CharacterOffsetBegin=8 CharacterOffsetEnd=9]
[Text=a CharacterOffsetBegin=12 CharacterOffsetEnd=13]
[Text=n CharacterOffsetBegin=14 CharacterOffsetEnd=15]
[Text=o CharacterOffsetBegin=16 CharacterOffsetEnd=17]
...
由于我对编程和coreNLP还是相当陌生,因此我似乎找不到解决此问题的方法,因为其他示例input.txt似乎工作正常。
任何帮助将不胜感激!