如何将文本转换为标题大小写?

时间:2016-12-09 11:39:02

标签: perl

我有一个文本文件,其中包含我需要更改为标题案例的标题列表(除了大多数文章,连词和介词之外,单词应以大写字母开头)。

例如,这个书名列表:

barbarians at the gate 
hot, flat, and crowded 
A DAY LATE AND A DOLLAR SHORT 
THE HITCHHIKER'S GUIDE TO THE GALAXY

应更改为:

Barbarians at the Gate 
Hot, Flat, and Crowded 
A Day Late and a Dollar Short 
The Hitchhiker's Guide to the Galaxy

我写了以下代码:

while(<DATA>)
{
    $_=~s/(\s+)([a-z])/$1.uc($2)/eg;
    print $_;
}

但它将每个单词的第一个字母大写,即使是标题中间的“at”,“the”和“a”等字样:

Barbarians At The Gate 
Hot, Flat, And Crowded 
A Day Late And A Dollar Short 
The Hitchhiker's Guide To The Galaxy

我该怎么做?

2 个答案:

答案 0 :(得分:4)

Thanks to See also Lingua::EN::TitlecaseHåkon Hægland given the way to get the output.

use Lingua::EN::Titlecase;
my $tc = Lingua::EN::Titlecase->new();

while(<DATA>)
{
    my $line = $_;
    my $tc = Lingua::EN::Titlecase->new($line);
    print $tc;
}

答案 1 :(得分:0)

You can also try using this regex: ^(.)(.*?)\b|\b(at|to|that|and|this|the|a|is|was)\b|\b(\w)([\w']*?(?:[^\w'-]|$)) and replace with \U$1\L$2\U$3\L$4. It works my matching the first letter of words that are not articles, capitalizing it, then matching the rest of the word. This seems to work in PHP, I don't know about Perl but it will likely work.

  • ^(.)(.*?)\b matches the first letter of the first word (group 1) and the rest of the word (group 2). This is done to prevent not capitalizing the first word because it's an article.
  • \b(word|multiple words|...)\b matches any connecting word to prevent capitalizing them.
  • (\w)([\w']*?(?:[^\w'-]|$)) matches the first letter of a word (group 3) and the rest of the word (group 4). Here I used [^\w'-] instead of \b so hyphens and apostrophes are counted as word characters too. This prevent 's from becoming 'S

The \U in replacement capitalizes the following characters and \L lowers them. If you want you can add more articles or words to the regex to prevent capitalizing them.

UPDATE: I changed the regex so you can include connecting phrases too (multiple words). But that will still make a very long regex...