Perl脚本中的大小写敏感度 - 如何使其不敏感?

时间:2010-03-02 21:05:01

标签: perl text scripting lowercase capitalization



按照目前的情况,如果你将99个小写句子和1个大写句子插入 - 你几乎总能在输出中找到大写句子的非标记版本。

# Copyright (C) 1999 Lucent Technologies
# Excerpted from 'The Practice of Programming'
# by Brian W. Kernighan and Rob Pike

# markov chain algorithm for 2-word prefixes

$MAXGEN = 10000;
$NONWORD = "\n";
$w1 = $w2 = $NONWORD;                    # initial state
while (<>)
{                                        # read each line of input
    foreach (split)
      push(@{$statetab{$w1}{$w2}}, $_);
      ($w1, $w2) = ($w2, $_);        # multiple assignment

push(@{$statetab{$w1}{$w2}}, $NONWORD);  # add tail
$w1 = $w2 = $NONWORD;

for ($i = 0; $i < $MAXGEN; $i++) 
    $suf = $statetab{$w1}{$w2};      # array reference
    $r = int(rand @$suf);            # @$suf is number of elems
    exit if (($t = $suf->[$r]) eq $NONWORD);
    print "$t\n";
    ($w1, $w2) = ($w2, $t);          # advance chain

3 个答案:

答案 0 :(得分:6)

Nathan Fellman和mobrule都提出了一种常见做法:Normalization





生成多层状态表的代码是最有趣的一点。我本可以使用Data :: Diver,但我想自己解决这个问题。

单词规范化代码真的应该允许规范化器返回要处理的单词列表,而不仅仅是单个单词 - 但我不想修复它现在可以返回一个列表其他的事情,如序列化你的处理语料库会很好,并且使用Getopt :: Long进行命令行开关仍然可以。我只做了有趣的事。

在不使用对象的情况下编写此文件对我来说有点挑战 - 这真的是制作马尔可夫生成器对象的好地方。我喜欢物体。但是,我决定保持代码的程序性,以便保留原始的精神。


use strict;
use warnings;

use IO::Handle;

use constant NONWORD => "-";
my $MAXGEN = 10000;
my $DEPTH  = 2;

my %state_table;

process_corpus( \*ARGV, $DEPTH, \%state_table );
generate_markov_chain( \%state_table, $MAXGEN );

sub process_corpus {
    my $fh    = shift;
    my $depth = shift;
    my $state_table = shift || {};;

    my @history = (NONWORD) x $depth;

    while( my $raw_line = $fh->getline ) {

        my $line = normalize_line($raw_line);
        next unless defined $line;

        my @words = map normalize_word($_), split /\s+/, $line;
        for my $word ( @words ) {

            next unless defined $word; 

            add_word_to_table( $state_table, \@history, $word );
            push  @history, $word;
            shift @history;


    add_word_to_table( $state_table, \@history, NONWORD );

    return $state_table;

# This was the trickiest to write.
# $node has to be a reference to the slot so that 
# autovivified items will be retained in the $table.
sub add_word_to_table {
    my $table   = shift;
    my $history = shift;
    my $word    = shift;

    my $node = \$table;

    for( @$history ) {
        $node = \${$node}->{$_};

    push @$$node, $word;

    return 1;

# Replace this with anything.
# Return undef to skip a word
sub normalize_word {
    my $word = shift;
    $word =~ s/[^A-Z]//g;
    return length $word ? $word : ();

# Replace this with anything.
# Return undef to skip a line
sub normalize_line {
    return uc shift;

sub generate_markov_chain {
    my $table   = shift;
    my $length  = shift;
    my $history = shift || [];

    my $node = $table;

    unless( @$history ) {

            ref $node eq ref {}
            exists $node->{NONWORD()} 
        ) {
            $node = $node->{NONWORD()};
            push @$history, NONWORD;


    for (my $i = 0; $i < $MAXGEN; $i++) {

        my $word = get_word( $table, $history );

        last if $word eq NONWORD;
        print "$word\n";

        push @$history, $word;
        shift @$history;

    return $history;

sub get_word {
    my $table   = shift;
    my $history = shift;

    for my $step ( @$history ) {
        $table = $table->{$step};

    my $word = $table->[ int rand @$table ];
    return $word;

<强>更新 我修复了上面的代码来处理从normalize_word()例程返回的多个单词。


sub normalize_line {
    return shift;

sub normalize_word {
    my $word = shift;

    # Sanitize words to only include letters and ?,.! marks 
    $word =~ s/[^A-Z?.,!]//gi;

    # Break the word into multiple words as needed.
    my @words = split /([.?,!])/, $word;

    # return all non-zero length words. 
    return grep length, @words;


答案 1 :(得分:5)


请参阅the lc function

答案 2 :(得分:4)


while (<>)
{                                        # read each line of input
    lc; # convert $_ to lowercase
    # etc.