如何使用标点符号自然语言格式化文本? Vim的内置gq
命令或命令行工具(例如fmt或par会打破行而不考虑标点符号。我举个例子,
fmt -w 40
没有给出我想要的东西:
we had everything before us, we had
nothing before us, we were all going
direct to Heaven, we were all going
direct the other way
smart_formatter -w 40
会给:
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way
当然,有些情况下,在给定文本宽度内没有找到标点符号,然后它可以回退到标准文本格式化行为。
我之所以想要这样,是为了获得一个有意义的diff
文本,我可以发现哪个句子或子句发生了变化。
答案 0 :(得分:0)
这是一个不太优雅,但我最终提出的工作方法。假设,标点符号处的换行值为6个字符。这意味着,如果“粗糙度”小于6个字符长,我将接受一个更粗糙的结果但包含更多以标点符号结尾的行。例如,这没关系(“粗糙”是3个字符)。
Wait!
He said.
这不行(“衣衫褴褛”超过6个字符)
Wait!
He said to them.
方法是在每个标点符号后添加6个虚拟字符,格式化文本,然后删除虚拟字符。
以下是此
的代码sed -e 's/\([.?!,]\)/\1 _ _ _/g' | fmt -w 34 | sed -e 's/ _//g' -e 's/_ //g'
我使用_
(空格+下划线)作为一对虚拟字符,假设它们未包含在文本中。结果看起来很不错,
we had everything before us,
we had nothing before us,
we were all going direct to
Heaven, we were all going
direct the other way