Question

如何使用标点符号自然语言格式化文本？ Vim的内置gq命令或命令行工具（例如fmt或par会打破行而不考虑标点符号。我举个例子，

fmt -w 40没有给出我想要的东西：

we had everything before us, we had
nothing before us, we were all going
direct to Heaven, we were all going
direct the other way

smart_formatter -w 40会给：

we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way

当然，有些情况下，在给定文本宽度内没有找到标点符号，然后它可以回退到标准文本格式化行为。

我之所以想要这样，是为了获得一个有意义的diff文本，我可以发现哪个句子或子句发生了变化。

Answer 1

这是一个不太优雅，但我最终提出的工作方法。假设，标点符号处的换行值为6个字符。这意味着，如果“粗糙度”小于6个字符长，我将接受一个更粗糙的结果但包含更多以标点符号结尾的行。例如，这没关系（“粗糙”是3个字符）。

Wait!
He said.

这不行（“衣衫褴褛”超过6个字符）

Wait!
He said to them.

方法是在每个标点符号后添加6个虚拟字符，格式化文本，然后删除虚拟字符。

以下是此

的代码

sed -e 's/\([.?!,]\)/\1 _ _ _/g' | fmt -w 34 | sed -e 's/ _//g' -e 's/_ //g'

我使用_（空格+下划线）作为一对虚拟字符，假设它们未包含在文本中。结果看起来很不错，

we had everything before us,
we had nothing before us,
we were all going direct to
Heaven, we were all going
direct the other way