用于修复.text文件中的断行的脚本?

时间:2010-06-09 14:48:11

标签: perl text-processing

我喜欢在我的Kindle上正确阅读书籍。

为了实现我的梦想,我需要一个脚本来修复txt文件中的断行。

例如,如果txt文件包含以下行:

He watched Kahlan as she walked with her shoulders slumped
down.

...然后它应该通过删除单词“down”之前的换行来修复它:

He watched Kahlan as she walked with her shoulders slumped down.

那么,其他程序员,(a)最简单的方法和(b)最好的语言?

P.S。解决方案将涉及在第1列中搜索小写字母,并在其之前删除换行符以将线拼接在一起。在我试图解决的小说中,有120万次出现这种“流氓突破”。

7 个答案:

答案 0 :(得分:2)

有很多方法可以做到这一点。我会推荐一些Perl,Python或Ruby的东西。如果你想用一个快速而肮脏的单行程来做到这一点,Perl在该部门有优势。

例如,这将按照您的要求执行:

# Slurp entire file.
# Convert newlines followed by lower-case letter.
perl -p -e 'BEGIN {$/ = undef}    s/\n(?=[a-z])/ /g' book.txt

但如果段落被2个换行符分隔,这可能会更好。

# Process file a "paragraph" at a time.
# Convert newlines followed by at least 2 characters.
perl -p -e 'BEGIN {$/ = qq{\n\n}} s/\n(?=..)/ /g'    book.txt

答案 1 :(得分:1)

如果段落之间有空格:请按段落(设置$/ = "\n\n"')阅读文字,然后使用CPAN中的Text::Autoformat

示例(将常规文件句柄替换为DATA - 我在示例中仅为方便起见使用它):

use strict;
use warnings;
use Text::Autoformat;

local $/ = "\n\n";
while (<DATA>) {
    print autoformat $_, {left=>1, right=>80};
}


__DATA__
He watched Kahlan as she walked with her shoulders slumped 
down. 

He watched Kahlan as she walked with her shoulders slumped 
down. 
He watched Kahlan as she walked with her shoulders slumped 
down. 
He watched Kahlan as she walked with her shoulders slumped 
down. 

He watched Kahlan as she walked with her shoulders slumped 
down. 
He watched Kahlan as she walked with her shoulders slumped 
down. 

输出:

He watched Kahlan as she walked with her shoulders slumped down.

He watched Kahlan as she walked with her shoulders slumped down. He watched
Kahlan as she walked with her shoulders slumped down. He watched Kahlan as she
walked with her shoulders slumped down.

He watched Kahlan as she walked with her shoulders slumped down. He watched
Kahlan as she walked with her shoulders slumped down.

答案 2 :(得分:0)

如果段落之间有换行符,您可以在一个好的文本编辑器中打开它,该编辑器可以选择“展开文本”。其中一个是Mac的TextMate,但也可能有Windows的选项。

答案 3 :(得分:0)

我想说解析这本书并查找换行符的出现次数。如果在一段时间后没有出现,则将其删除。唯一的问题是它在这种特殊情况下不起作用:

  

他看着Kahlan走路时肩膀塌了下来。\ n

     

他看着Kahlan走路时肩膀塌了下来。

而不是:

  

他看着Kahlan走路时肩膀塌了下来。他看着Kahlan走路时肩膀塌了下来。

在这种情况下,您必须确定段落的分隔方式(它们是两个换行符吗?)。如果是这种情况,请在一段时间后检查,如果有两个换行符。如果没有,则删除第一个换行符。

答案 4 :(得分:0)

使用正则表达式匹配紧接在换行符之前的小写字符,然后用空格替换该换行符应该可以解决问题。

这是一个C#实现;

    string UnwrapText(string input)
    {
        return Regex.Replace(input, Environment.NewLine + "[a-z]",
                            delegate(Match m)
                            {
                                return m.ToString().Replace(Environment.NewLine, " ");
                            });
    }

答案 5 :(得分:0)

如果段落以制表符开头,最有效的方法可能是删除不在制表符之前的所有换行符,并用空格替换它们。

如果没有,您可以核对所有不属于2个或更多新行的新行。

您还可以查看不遵循句点的所有换行符,但如上所述,如果句子结束一行而不是段落,则会失败。

答案 6 :(得分:0)

使用vim,:set tw=0 noai,然后gggqG打开文件。如果文件的表现相当合理,则应该删除段落中的所有换行符,同时保留段落中断。