我需要从文本文件的前6行中删除空白行。我尝试使用this StackOverflow question和this file拼凑出一个解决方案,但无济于事。
这是我使用的sed脚本(别名为faprep='~/misc-scripts/fa-prep.sed
),最后一个命令是失败的:
#!/opt/local/bin/sed -f
# Title Treatments
s|<\(/\?\)h1[^>]*\?>|[\1b]|g # Replace <h1></h1> with [b][/b] for saga titles
s|<\(/\?\)h2[^>]*\?>|[\1i]|g # Replace <h2></h2> with [i][/i] for arc titles
s|</\?h3[^>]*\?>||g # Strip <h3 id=""></h3> out without removing chapter title text
# HTML tag strips & substitutions
s|</\?p>||g # Strip all <p></p> tags
s|<\(/\?\)em>|[\1i]|g # Change <em></em> to [i][/i]
s|<\(/\?\)strong>|[\1b]|g # Change <strong></strong> to [b][/b]
# Character code substitutions
s/&\#822[01];/\"/g # Replace “ and ” with straight double quote (")
s/&\#8217;/\'/g # Replace ’ with straight single quote (')
s/&\#8230;/.../g # Replace … with a 3-period ellipsis (...)
s/&\#821[12];/--/g # Replace — with a 2-hyphen em dash (--)
# Final prep; stripping out unnecessary cruft
/<body>/,/<\/body>/!d # Delete everything OUTSIDE the <body></body> tags
/<\/\?body>/d # Then, delete the body tags :3
# Pay attention to meeeeeeee!!!!
1,6{/./!d} # Remove blank lines from around titles??
这是我从终端运行的命令,它显示最后一行未能从文件的前6行中删除空格(当然,在完成所有其他修改之后) :
calyodelphi@dragonpad:~/pokemon-story/compilations $ ch='ch6'; faprep $ch-mmd.html > $ch-fa.txt; head -6 $ch-fa.txt
[b]Hoenn Saga (S1)[/b]
[i]Next City Arc (A2)[/i]
Chapter 6: A Peaceful City Stroll... Or Not
calyodelphi@dragonpad:~/pokemon-story/compilations $
文件的其余部分由第三个标题后面的空白行组成,然后所有段落都用空行分隔。我想保留这些空行,以便只剥离顶部标题之间的空白行。
只是为了澄清几点:这个文件有Unix行结尾,这些行应该没有空格。即使在显示空白的文本编辑器中查看,每个空行也只包含换行符。
答案 0 :(得分:0)
由于评论中的讨论清楚地表明你想要忽略body标签的前六行中的空行 - 换句话说,前六次到达脚本的那一部分 - 而不是前六行总输入数据,不能使用全局行计数器。由于您没有使用保持缓冲区,我们可以使用它来构建我们自己的计数器。
所以,替换
1,6 { /./! d }
与
x # swap in hold buffer
/.\{6\}/! { # if the counter in it hasn't reached 6
s/^/./ # increment by one (i.e., append a character)
x # swap the input back in
/./!d # if it is empty, discard it
x # otherwise swap back
}
x # and swap back one more time. This dance ensures that the
# line from the input is in the pattern space when we drop
# out at the bottom to the printing, regardless of which
# branches were entered.
或者,如果这看起来太复杂,请使用@glennjackman的建议并通过sed '1,6 { /./! d; }'
管道第一个sed脚本的输出,因为第二个进程将有自己的行计数器处理预处理数据。它没有乐趣,但它会起作用。
答案 1 :(得分:0)
这个回答是由@ Wintermute对我的问题提出的评论,指出了我正确的方向!当我把删除语句放在最后时,我错误地认为sed正在修改流。当我尝试一个不同的地址(行9,14
)时,它工作得很好,但对我来说太沉闷了。但是这证实了我需要把流视为仍然包括我认为已经消失的行。
所以我将删除语句移到清除<body>
标记及其外部所有内容的语句之上,并使用正则表达式和addr1,+N trick here生成最终结果:
剧本:
#!/opt/local/bin/sed -f
# Title Treatments
s|<\(/\?\)h1[^>]*\?>|[\1b]|g # Replace <h1></h1> with [b][/b] for saga titles
s|<\(/\?\)h2[^>]*\?>|[\1i]|g # Replace <h2></h2> with [i][/i] for arc titles
s|</\?h3[^>]*\?>||g # Strip <h3 id=""></h3> out without removing chapter title text
# HTML tag strips & substitutions
s|</\?p>||g # Strip all <p></p> tags
s|<\(/\?\)em>|[\1i]|g # Change <em></em> to [i][/i]
s|<\(/\?\)strong>|[\1b]|g # Change <strong></strong> to [b][/b]
# Character code substitutions
s/&\#822[01];/\"/g # Replace “ and ” with straight double quote (")
s/&\#8217;/\'/g # Replace ’ with straight single quote (')
s/&\#8230;/.../g # Replace … with a 3-period ellipsis (...)
s/&\#821[12];/--/g # Replace — with a 2-hyphen em dash (--)
# Final prep; stripping out unnecessary cruft
/<body>/,+6{/^$/d} # Remove blank lines from around titles
/<body>/,/<\/body>/!d # Delete everything OUTSIDE the <body></body> tags
/<\/\?body>/d # Then, delete the body tags :3
结果输出:
calyodelphi@dragonpad:~/pokemon-story/compilations $ ch='ch6'; faprep $ch-mmd.html > $ch-fa.txt; head -6 $ch-fa.txt
[b]Hoenn Saga (S1)[/b]
[i]Next City Arc (A2)[/i]
Chapter 6: A Peaceful City Stroll... Or Not
The next two weeks of training passed by too quickly and too slowly at the same time. [rest of paragraph omitted for space]
calyodelphi@dragonpad:~/pokemon-story/compilations $
谢谢@Wintermute! :d