Question

我有一个逐行读取csv文件的脚本，并将字段2中的标题与另一个csv文件进行比较。如果5个或更多单词匹配，则会打印出符合此条件的每个文件的行。这是脚本：

#!/bin/perl

#subroutine for discovering year

sub find_year {
    my( $str ) = @_;
    my $year = $1 if( $str =~ /\b((?:19|20)\d\d)\b/ );
    return $year
}

#####CREATE CSV2 DATA

my @csv2 = ();

open CSV2, "<csv2" or die;
@csv2=<CSV2>;
close CSV2;

my %csv2hash = ();
my @csv2years;

for ( @csv2 ) {
    chomp;
    my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/define the data which is the title
    $csv2hash{$_} = $title; # Indicate that title data will input into csv2hash.
}

###### CREATE CSV1 DATA

open CSV1, "<csv1" or die;

while (<CSV1>) {
    chomp;      #removes new lines

    my ($title) = $_ =~ /^.+?,\s*([^,]+?),/; #/ creates variable of title
    my %words;

    $words{$_}++ for split /\s+/, $title;    #/ get words

    ## Collect unique words into an array- the @ means an array

    my @titlewords = keys(%words);

    # Add exception words which shouldn't be matched.

    my @new;
    foreach my $t (@titlewords){
        push(@new, $t) if $t !~ /^(rare|vol|volume|issue|double|magazine|mag)$/i;
    }


    ###### The comparison algorithm

    @titlewords = @new;

    my $desired = 5;      # Desired matching number of words
    my $matched = 0;

    foreach my $csv2 (keys %csv2hash) {
        my $count = 0;
        my $value = $csv2hash{$csv2};

        foreach my $word (@titlewords) {
            my @matches   = ( $value=~/\b$word\b/ig );
            my $numIncsv2 = scalar(@matches);

            @matches      = ( $title=~/\b$word\b/ig );

            my $numIncsv1 = scalar(@matches);

            ++$count if $value =~ /\b$word\b/i;

            if ($count >= $desired || ($numIncsv1 >= $desired && $numIncsv2 >= $desired)) {
                $count = $desired+1;
                last;
            }
        }

        if ($count >= $desired) {
            print "$csv2\n";
            ++$matched;
        }
    }
    print "$_\n\n" if $matched;
}

正如你所看到的，我已经创建了一个find_year子程序，可用于发现标题是否包含20或21世纪的年份（19xx或20xx）。几天前我问了一个问题，这个问题可以让我把结果分配给一组涉及一年匹配的条件，而鲍罗丁在这里提供了一个很好的答案。

Perl- What function am I looking for? Assigning multiple rules to a specified outcome

我希望现在适用相同的条件，只是这次脚本将比较csv标题中的日期而不是标准输入和数据列表（如上一个问题所示）。

我现在要做的是将此逻辑作为我的单词匹配脚本中的函数包含在内，这样如果在我之前的问题中遇到的条件被认为是Pass，那么执行脚本的单词匹配部分（即5个单词匹配）。如果它们匹配失败条件，则跳过比较行并移动到下一行（即不要打扰脚本的5个匹配的单词元素）。通过和失败结果不必打印出来，我只是用这些词来描述我之前问题中年份比较条件的规则。

csv1的例子：

14564564,1987 the door to the other doors,546456,47878787
456456445,Mullholland Drive is the bets film ever 1959,45454545,45454545
456456445,The Twin Peaks forget that stuff,45454545,45454545
454654564, 1939 hello good world you are great ,45456456, 54564654

csv2的例子：

154465454,the other door was the door to 1949,546456,478787870
2156485754,Mullholland Drive is the bets film ever 1939,45454545,45454545
87894454,Twin Peaks forget that stuff 1984,45454545,45454545
2145678787, 1939 good lord you are great ,787425454,878777874

包含year_match子例程之前的当前结果：

2156485754,Mullholland Drive is the best film ever 1939,45454545,45454545
456456445,Mullholland Drive is the best film ever 1959,45454545,45454545

87894454,Twin Peaks forget that stuff 1984,45454545,45454545
456456445,The Twin Peaks forget that stuff,45454545,45454545

2145678787, 1939 good lord you are great ,787425454,878777874
454654564, 1939 hello good world you are great ,45456456, 54564654

包含match_year子程序后的所需结果：

87894454,Twin Peaks forget that stuff 1984,45454545,45454545
456456445,The Twin Peaks forget that stuff,45454545,45454545

2145678787, 1939 good lord you are great ,787425454,878777874
454654564, 1939 hello good world you are great ,45456456, 54564654

我可以理解鲍罗丁对前一个问题的回答，但是我正在编写的这个剧本难以阅读（无论如何都是IMO noob意见！），我在弄清楚如何融入这个新内容时遇到了麻烦功能进入它。

Answer 1

我查看算法。将许多csv2循环替换为包含csv2行数列表的单词的哈希值。不再需要初步检查年份。

#!/usr/bin/perl
#use Data::Dumper;
#####CREATE CSV2 DATA
open CSV2, "<csv2" or die;
my @csv2=<CSV2>;
close CSV2;
my %words2; # $words2{lower_case_word}->{csv2_row_number}->word_count
my $i=0; # csv2 row number
my %c2year; # Years of csv2 row numbers
for(@csv2) {
   chomp;
   for((split /,\s*/)[1]=~/(\w+)/g) { # list words in title
    $words2{lc($_)}{$i}++;
    $c2year{$i}=$_ if(/^(19|20)\d\d$/);
   }
   $i++;
}
#print Dumper(\%words2);

###### READ CSV1 DATA
my $desired = 5;      # Desired matching number of words

open CSV1, "<csv1" or die;
while (<CSV1>) {
   chomp;       #removes new lines
   my %rows=(); # $rows{csv2_row_number} => number_of_matched_words_in_row
   my $matched = 0;
   my ($title) = (split /,\s*/)[1]; #/ creates variable of title
   my %words;
   my $year=0;
####### get words and filter it
   $words{lc($_)}++ for
       grep {
         $year=$_ if(/^(19|20)\d\d$/); # Years present in word list
         !/^(rare|vol|volume|issue|double|magazine|mag)$/i
       } $title=~/(\w+)/g; #/
###### The comparison algorithm
   for(keys(%words)) {
    # my $word=$_; # <-- if need count words
    if($words2{$_}) {
     for(keys(%{$words2{$_}})) {
      $rows{$_}++; # <-- OR $rows{$_}+=$words{$word} OR/AND +=$words2{$word}{$_}
     }
    }
   }
#    print Dumper(\%rows);
   for(keys(%rows)) {
      if ( ($rows{$_} >= $desired)
          && (!$year || !$c2year{$_} || $year==$c2year{$_} )
         ) {
        print "$year<=>$c2year{$_} csv2: ",$csv2[$_],"\n";
        ++$matched;
      }
   }
 print "csv1: $_\n\n" if $matched;
}

取消注释use Data::Dumper和print Dumper(...)以进行哈希审核。

如果需要考虑相同单词的数量，那么：

###### The comparison algorithm
   for(keys(%words)) {
    my $W=$_;
    if($words2{$_}) {
     for(keys(%{$words2{$_}})) {
      $rows{$_} += $words{$W} < $words2{$W}{$_} ? $words{$W} : $words2{$W}{$_};
      # $words{$W} - same word count in csv1, $words2{$W}{$_} - count in csv2
     }
    }
   }
#    print Dumper(\%rows);

将年匹配子程序合并到脚本中并将条件应用于结果

1 个答案: