Question

使用PTBTokenizer，然后输入文字：

hey, how are you? SENT_SEP I'm doing well, thanks!
the quick, brown fox. SENT_SEP Jumps over the lazy dog.

使用-preserveLines选项，我们可以将每个句子正确地标记为单词：

hey , how are you ? SENT_SEP I 'm doing well , thanks !
the quick , brown fox . SENT_SEP Jumps over the lazy dog .

是否存在类似的选项将文档标记为带有DocumentPreprocessor的句子（例如，使用除\n之外的另一个定界符来分隔句子）？我的输入是一个文件，每行一个文档：

hey, how are you? I'm doing well, thanks!
the quick, brown fox. Jumps over the lazy dog.

但是如果我将DocumentPreprocessor应用于它，则会得到：

hey , how are you ?
I 'm doing well , thanks !
the quick , brown fox .
Jumps over the lazy dog .

破坏一行=一份文档！

我尝试使用tokenizePerLine和tokenizeNLs选项没有成功。

由于我的输入文件有几千万行，因此将每一行写入其自己的.txt文件并将DocumentPreprocessor分别应用于每个文件是非常低效率的。

相关，但我想知道是否有比建议的解决方案更简单的解决方案：How to Preserve Original Line Numbering in the Output of Stanford CoreNLP?