格式化关于标点符号的文本

时间:2011-05-16 19:48:06

标签: text formatting diff nlp

如何使用标点符号自然语言格式化文本? Vim的内置gq命令或命令行工具(例如fmtpar会打破行而不考虑标点符号。我举个例子,

fmt -w 40没有给出我想要的东西:

we had everything before us, we had
nothing before us, we were all going
direct to Heaven, we were all going
direct the other way

smart_formatter -w 40会给:

we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way

当然,有些情况下,在给定文本宽度内没有找到标点符号,然后它可以回退到标准文本格式化行为。

我之所以想要这样,是为了获得一个有意义的diff文本,我可以发现哪个句子或子句发生了变化。

1 个答案:

答案 0 :(得分:0)

这是一个不太优雅,但我最终提出的工作方法。假设,标点符号处的换行值为6个字符。这意味着,如果“粗糙度”小于6个字符长,我将接受一个更粗糙的结果但包含更多以标点符号结尾的行。例如,这没关系(“粗糙”是3个字符)。

Wait!
He said.

这不行(“衣衫褴褛”超过6个字符)

Wait!
He said to them.

方法是在每个标点符号后添加6个虚拟字符,格式化文本,然后删除虚拟字符。

以下是此

的代码
sed -e 's/\([.?!,]\)/\1 _ _ _/g' | fmt -w 34 | sed -e 's/ _//g' -e 's/_ //g'

我使用_(空格+下划线)作为一对虚拟字符,假设它们未包含在文本中。结果看起来很不错,

we had everything before us,
we had nothing before us,
we were all going direct to
Heaven, we were all going
direct the other way