Question

我目前正在编写一个代码，将某些单词更改为莎士比亚的单词。我必须提取包含单词的句子并将其打印到另一个文件中。我不得不从每个文件的开头删除.START。

首先我用文本用空格分割文件，所以现在我有了这些文字。接下来，我通过哈希迭代这些单词。散列键和值来自制表符分隔文件，其结构如下，OldEng / ModernEng（lc_Shakespeare_lexicon.txt）。现在，我正在试图弄清楚如何找到每个现代英语单词的确切位置，将其改为莎士比亚;然后找到带有更改单词的句子并将它们打印到不同的文件中。除了最后一部分，大部分代码都已完成。到目前为止，这是我的代码：

#!/usr/bin/perl -w
use diagnostics;
use strict;

#Declare variables
my $counter=();
my %hash=();
my $conv1=();
my $conv2=();
my $ssph=();
my @text=();
my $key=();
my $value=();
my $conversion=();
my @rmv=();
my $splits=();
my $words=();
my @word=();
my $vals=();
my $existingdir='/home/nelly/Desktop';
my @file='Sentences.txt'; 
my $eng_words=();
my $results=();
my $storage=();

#Open file to tab delimited words

open (FILE,"<", "lc_shakespeare_lexicon.txt") or die "could not open        lc_shakespeare_lexicon.txt\n";

#split words by tabs 

while (<FILE>){ 
    chomp($_);
    ($value, $key)= (split(/\t/), $_);
    $hash{$value}=$key; 
}   

#open directory to Shakespearean files

my $dir="/home/nelly/Desktop/input"; 
opendir(DIR,$dir) or die "can't opendir Shakespeare_input.tar.gz";
#Use grep to get WSJ file and store into an array

my @array= grep {/WSJ/} readdir(DIR);

#store file in a scalar
foreach my $file(@array){

    #open files inside of input

    open (DATA,"<", "/home/nelly/Desktop/input/$file") or die "could not open $file\n";
    #loop through each file

    while (<DATA>){
        @text=$_;
        chomp(@text);
    #Remove .START
    @rmv=grep(!/.START/, @text);

foreach $splits(@rmv){
    #split data into separate words
    @word=(split(/ /, $splits));
    #Loop through each word and replace with Shakespearean word that exists
    $counter=0;

foreach $words(@word){
        if (exists $hash{$words}){
            $eng_words= $hash{$words};
            $results=$counter;
            print "$counter\n";
            $counter++;

#create a new directory and store senteces with Shakespearean words in new file called "Sentences.txt"
mkdir $existingdir unless -d $existingdir; 
open my $FILE, ">>", "$existingdir/@file", or die "Can't open       $existingdir/conversion.txt'\n";
#print $FILE "@words\n";

close ($FILE);

                }           
            }
        }
    }   
}

close (FILE);
close (DIR);

Answer 1

除了琐碎的情况外，自然语言处理很难正确处理，例如很难准确定义单词或句子的含义，以及当使用U+0027＆＃34;撇号＆＃34;来表示它们时，区分单引号和撇号是很尴尬的。字符'

没有任何示例数据，很难编写可靠的解决方案，但下面的程序应该相当接近

请注意以下内容

use warnings优于shebang线上的-w
只要可理解，程序应包含尽可能少的注释。太多的评论只会使程序更大，更难以掌握而不添加任何新信息。标识符的选择应该使代码主要是自我记录
我认为use diagnostics是不必要的。大多数消息都是相当不言自明的，diagnostics可以产生大量不必要的输出
因为您要打开多个文件，所以use autodie更简洁，这样就无需明确测试每个open调用是否成功
最好使用词法文件句柄，例如open my $fh ...而不是全局句柄，例如open FH ...。首先，词法文件句柄在超出范围时将被隐式关闭，这有助于通过明确close次调用来大量整理程序
我已从程序顶部删除了所有变量声明，但非空的变量声明除外。这种方法被认为是最佳实践，因为它有助于调试并协助编写干净的代码
在检查散列中是否存在匹配条目之前，程序使用lc对原始单词进行小写。如果找到翻译，则如果原始单词以大写字母开头，则使用ucfirst将新单词大写
我写了一个正则表达式，它将从字符串$content的开头处取下一个句子。但这是没有样本数据我能做到的事情之一，并且可能存在问题，例如，句子以右引号或右括号结尾

use strict;
use warnings;
use autodie;

my $lexicon      = 'lc_shakespeare_lexicon.txt';
my $dir          = '/home/nelly/Desktop/input';
my $existing_dir = '/home/nelly/Desktop';
my $sentences    = 'Sentences.txt';

my %lexicon = do {
  open my ($fh), '<', $lexicon;
  local $/;
  reverse(<$fh> =~ /[^\t\n\r]+/g);
};

my @files = do {
  opendir my ($dh), $dir;
  grep /WSJ/, readdir $dh;
};

for my $file (@files) {

  my $contents = do {
    open my $fh, '<', "$dir/$file";
    join '', grep { not /\A\.START/ } <$fh>;
  };

  # Change any CR or LF to a space, and reduce multiple spaces to single spaces
  $contents =~ tr/\r\n/  /;
  $contents =~ s/ {2,}/ /g;

  # Find and process each sentence
  while ( $contents =~ / \s* (.+?[.?!]) (?= \s+ [A-Z] | \s* \z ) /gx ) {
    my $sentence = $1;
    my @words    = split ' ', $sentence;
    my $changed;

    for my $word (@words) {
      my $eng_word = $lexicon{lc $word};
      $eng_word = ucfirst $eng_word if $word =~ /\A[A-Z]/;
      if ($eng_word) {
        $word = $eng_word;
        ++$changed;
      }
    }

    if ($changed) {
      mkdir $existing_dir unless -d $existing_dir;
      open my $out_fh, '>>', "$existing_dir/$sentences";
      print "@words\n";
    }
  }
}

如何使用计数器查找单词的位置？

1 个答案: