使用sed从前n行中去除空白行

时间:2015-03-17 18:55:40

标签: regex sed

我需要从文本文件的前6行中删除空白行。我尝试使用this StackOverflow questionthis file拼凑出一个解决方案,但无济于事。

这是我使用的sed脚本(别名为faprep='~/misc-scripts/fa-prep.sed),最后一个命令是失败的:

#!/opt/local/bin/sed -f

# Title Treatments
s|<\(/\?\)h1[^>]*\?>|[\1b]|g    # Replace <h1></h1> with [b][/b] for saga titles
s|<\(/\?\)h2[^>]*\?>|[\1i]|g    # Replace <h2></h2> with [i][/i] for arc titles
s|</\?h3[^>]*\?>||g             # Strip <h3 id=""></h3> out without removing chapter title text

# HTML tag strips & substitutions
s|</\?p>||g                 # Strip all <p></p> tags
s|<\(/\?\)em>|[\1i]|g       # Change <em></em> to [i][/i]
s|<\(/\?\)strong>|[\1b]|g   # Change <strong></strong> to [b][/b]

# Character code substitutions
s/&\#822[01];/\"/g  # Replace &#8220; and &#8221; with straight double quote (")
s/&\#8217;/\'/g     # Replace &#8217; with straight single quote (')
s/&\#8230;/.../g    # Replace &#8230; with a 3-period ellipsis (...)
s/&\#821[12];/--/g  # Replace &#8212; with a 2-hyphen em dash (--)

# Final prep; stripping out unnecessary cruft
/<body>/,/<\/body>/!d   # Delete everything OUTSIDE the <body></body> tags
/<\/\?body>/d           # Then, delete the body tags :3

# Pay attention to meeeeeeee!!!!
1,6{/./!d}      # Remove blank lines from around titles??

这是我从终端运行的命令,它显示最后一行未能从文件的前6行中删除空格(当然,在完成所有其他修改之后) :

calyodelphi@dragonpad:~/pokemon-story/compilations $ ch='ch6'; faprep $ch-mmd.html > $ch-fa.txt; head -6 $ch-fa.txt

[b]Hoenn Saga (S1)[/b]

[i]Next City Arc (A2)[/i]

Chapter 6: A Peaceful City Stroll... Or Not
calyodelphi@dragonpad:~/pokemon-story/compilations $

文件的其余部分由第三个标题后面的空白行组成,然后所有段落都用空行分隔。我想保留这些空行,以便只剥离顶部标题之间的空白行。

只是为了澄清几点:这个文件有Unix行结尾,这些行应该没有空格。即使在显示空白的文本编辑器中查看,每个空行也只包含换行符。

2 个答案:

答案 0 :(得分:0)

由于评论中的讨论清楚地表明你想要忽略body标签的前六行中的空行 - 换句话说,前六次到达脚本的那一部分 - 而不是前六行总输入数据,不能使用全局行计数器。由于您没有使用保持缓冲区,我们可以使用它来构建我们自己的计数器。

所以,替换

1,6 { /./! d }

x               # swap in hold buffer
/.\{6\}/! {     # if the counter in it hasn't reached 6
  s/^/./        # increment by one (i.e., append a character)
  x             # swap the input back in
  /./!d         # if it is empty, discard it
  x             # otherwise swap back
}
x               # and swap back one more time. This dance ensures that the
                # line from the input is in the pattern space when we drop
                # out at the bottom to the printing, regardless of which
                # branches were entered.

或者,如果这看起来太复杂,请使用@glennjackman的建议并通过sed '1,6 { /./! d; }'管道第一个sed脚本的输出,因为第二个进程将有自己的行计数器处理预处理数据。它没有乐趣,但它会起作用。

答案 1 :(得分:0)

这个回答是由@ Wintermute对我的问题提出的评论,指出了我正确的方向!当我把删除语句放在最后时,我错误地认为sed正在修改流。当我尝试一个不同的地址(行9,14)时,它工作得很好,但对我来说太沉闷了。但是这证实了我需要把流视为仍然包括我认为已经消失的行。

所以我将删除语句移到清除<body>标记及其外部所有内容的语句之上,并使用正则表达式和addr1,+N trick here生成最终结果:

剧本:

#!/opt/local/bin/sed -f

# Title Treatments
s|<\(/\?\)h1[^>]*\?>|[\1b]|g    # Replace <h1></h1> with [b][/b] for saga titles
s|<\(/\?\)h2[^>]*\?>|[\1i]|g    # Replace <h2></h2> with [i][/i] for arc titles
s|</\?h3[^>]*\?>||g             # Strip <h3 id=""></h3> out without removing chapter title text

# HTML tag strips & substitutions
s|</\?p>||g                 # Strip all <p></p> tags
s|<\(/\?\)em>|[\1i]|g       # Change <em></em> to [i][/i]
s|<\(/\?\)strong>|[\1b]|g   # Change <strong></strong> to [b][/b]

# Character code substitutions
s/&\#822[01];/\"/g  # Replace &#8220; and &#8221; with straight double quote (")
s/&\#8217;/\'/g     # Replace &#8217; with straight single quote (')
s/&\#8230;/.../g    # Replace &#8230; with a 3-period ellipsis (...)
s/&\#821[12];/--/g  # Replace &#8212; with a 2-hyphen em dash (--)

# Final prep; stripping out unnecessary cruft
/<body>/,+6{/^$/d}      # Remove blank lines from around titles
/<body>/,/<\/body>/!d   # Delete everything OUTSIDE the <body></body> tags
/<\/\?body>/d           # Then, delete the body tags :3

结果输出:

calyodelphi@dragonpad:~/pokemon-story/compilations $ ch='ch6'; faprep $ch-mmd.html > $ch-fa.txt; head -6 $ch-fa.txt
[b]Hoenn Saga (S1)[/b]
[i]Next City Arc (A2)[/i]
Chapter 6: A Peaceful City Stroll... Or Not

The next two weeks of training passed by too quickly and too slowly at the same time. [rest of paragraph omitted for space]

calyodelphi@dragonpad:~/pokemon-story/compilations $

谢谢@Wintermute! :d