格式化版本控制的文本

时间:2016-05-19 15:35:30

标签: regex awk sed

我的许多文档都是使用LaTeX编写的,如果格式正确,则可以使用分布式工作流和版本控制。具体来说,我喜欢用每行一个句子来格式化文本。

我的问题是我有一些遗留文件要转换,不遵循这种格式化政策,我想以自动方式转换它们。我认为sed和/或awk的某种组合应该很简单,但我遇到了一些麻烦。

我正在尝试转换

This is some unformatted
text that does not have a sentence on one line.

This is a new unformatted paragraph
that does not follow the rule either.

This line \\ has a break in it.

This is some unformatted text that does not have a sentence on one line.

This is a new unformatted paragraph that does not follow the rule either.

This line \\
has a break in it.

我到目前为止的sed / awk如下:

awk ' /^$/ { print "\n"; } /./ { printf("%s", $0); } END { print; } ' <filename> | sed -e $'s/\. /\.\\\n/g'

这让我大部分都在那里,但是我无法获得\\后跟换行字符才能正常工作。

非常感谢您的帮助。

2 个答案:

答案 0 :(得分:1)

<强>输入

$ cat text
This is some unformatted
text that does not have a sentence on one line.

This is a new unformatted paragraph
that does not follow the rule either.

This line \\ has a break in it.

This line too \\ contains break.
This is a normal line.

<强>脚本

 $ awk 'BEGIN{RS=".";}
 {$0=gensub(/([[:print:]?])\n/,"\\1 ","g");
 $0=gensub(/(\\\\) /,"\\1\n","g");
 printf "%s.",$0}
 END{printf "\n"}' text

<强>输出

This is some unformatted text that does not have a sentence on one line.

This is a new unformatted paragraph that does not follow the rule either.

This line \\
has a break in it.

This line too \\
contains break.
This is a normal line .

注意:这假设你有gnu-awk。

答案 1 :(得分:1)

$ awk -v RS= -v ORS='\n\n' -F'\\\\\\\\[[:space:]]*' -v OFS='\n' '{gsub(/\n/," "); $1=$1}1' file
This is some unformatted text that does not have a sentence on one line.

This is a new unformatted paragraph that does not follow the rule either.

This line
has a break in it.