尝试在文件+附加文本的文本中匹配文件的部分名称

时间:2014-11-12 03:49:52

标签: regex perl match

您好我试图在文件+附加文字的文本中匹配文件的部分名称。

基本上我有这样的文件:

PieceIwanttomatch_don't_care_about_this.txt

我正在尝试匹配,首先说文件名的七个字母加上文件中的一个字符串,我没有运气。

这是我到目前为止所拥有的:

use strict;
use warnings;

use File::Path qw(make_path remove_tree);

my $calls_dir = "Ask/Parsed/Html/";
opendir(my $search_dir, $calls_dir) or die "$!\n";
my @files = grep /\.txt$/i, readdir $search_dir;
closedir $search_dir;

#print "Got ", scalar @files, " files\n";

#my %seen = ();
for my $file (@files) {

  my %seen         = ();
  my $current_file = $calls_dir . $file;
  open my $FILE, '<', $current_file or die "$file: $!\n";

  while (<$FILE>) {

    #if (/phone/i) {
    chomp;

    #if (/phone\s*(.*)\r?$/i) {
    #if (/^phone\s*:\s*(.*)\r?$/i) {
    #if (/Contact\s*(.*)\r?$/i) {
    #if (/^*(.*)team\s*(.*)\r?$/i) {

    print substr(${file}, 0, 7);

    if (/^(?=.* 'substr(${file}, 0, 7)')(?=.*management)/s) {

      $seen{$1} = 1;

      #print $file."\t"."$_\n";
      #open my $fh, '>', "Ask/Parsed/Html2/"."${file}.parsed_for_contact_us.txt" or die $!;

      make_path('Ask/Parsed/Html2/');
      open my $fh, '>', "Ask/Parsed/Html2/" . "${file}.parsed_for_management.txt" or die $!;
      #open my $fh, '>', "$_"."result".".txt" or die $!;

      #$fh->print("$file\t$_\n");
      $fh->print("$_\n");
      print "$_\n";

      #print "\t";
      print "\n";
      print "\t";

      #print "$_\n";
      #print "\t";
      #print "\n";

      foreach my $addr (sort keys %seen) {

      }
    }
  }

  close $FILE;
}

这是人们看到的另一个例子:

我想我要做的一个例子:说我的文件名为nintendo_ask_parse.html。我尝试使用文件名中的字符串nintendo和另一个字符串(比如game)来查找文件中的一行并将其打印到另一个文件。

于2014年12月12日增加 根据迄今为止一直在帮助我的一些人的要求,我们提供了更多数据。我正在运行我写的第一个脚本,用于将URL拉入文件。这是脚本:

 use strict;
 use warnings;
 use LWP::Simple;

 my $link1 = "http://www.ask.com/web?q=";
 my $link2 = "+video+game&qsrc=0&o=0&l=dir&qo=homepageSearchBox";
 #my $link3 = "http://www.";
 #my $link4 = "http://www.manta.com/search?          search_source=nav&pt=&search_location=Burlingame+CA&search=";

 open (my $fh2, "untitled.txt")
 or die "Could not open file";

 while (my $row = <$fh2>) {
 chomp $row;
 print "$row\n";
 my $xml1 = $link1 . $row. $link2 ;
 #my $xmla = $link3 . $row . ".com";
 #my $xmlx = $link4 . $row;
 mkdir 'Ask', 0755;
 my $filename1 = "Ask/".($row)."_"."ask".".html";
 open my $fh1, ">", $filename1 or die("Could not open file. $!");

 print $row;
 my $xml2 = get $xml1;
 print $xml1;
 print "\n";
 print $fh1 $xml2;


 }

=============================================== ============================== 运行此脚本后,我会根据untitled.txt文件中的条目数获取html文件,每个条目1个。

我有四个示例文件,它们通过运行上面的脚本命名为Activision_ask.html,Apple_ask.html,Atari_ask.html,Nintendo_ask.html。以下是一个文件Activion_ask.html:

的内容
     Answers
     Q&A Community
     Advanced Search


     Everything
     Images
     News
     First Video Game Invented
     Video Game Design
     Wii
     Video Game Designer Career
     Video Game Companies
     Spider-man 3 Video Game
     Video Game Walkthroughs
     Video Game Statistics
     Call of Duty 4
     More Answers
     Amazon.com results for activision


     Source
     Activision Publishing, Inc. is an American video game publisher. It was founded on October 1,      1979 and was the world's first independent developer and distributor of video games for gaming   consoles. Its first products were cartridges for the Atari 2600 video console system published from July 1980 for the US market and from August 1981 for the international market (UK). Activision is now one of the largest video game publishers in the world and was also the top publisher for 2... Read More »
Go to: Ask Encyclopedia · Images · Videos
Browse Article: History · Studios · Notable games published · Upcoming games · References ·
Source: Wikipedia
Related Questions:
     •
     Who was the Video game publisher of LOOM?
     •
     Who is developing the games for Activision and what have they done in the past? We hear the  handheld versions of the game are different than the console versions. Care to enlighten us?
     •
     This game was created by "Activision" for the "Atari 2600". Up to four players could play at one time. Which one was it?
     View more Q&A »

     www.giantbomb.com/activision/3010-78/

     Oct 9, 2014 ... Activision is the largest third-party publisher in the world. It became the first third- party developer for video game consoles, and is responsible ...

      Explore More Answers About

     Source: www.kgbanswers.com

     About · Privacy · Terms · Careers · Ask Blog · Q&A · Mobile · Help · Feedback © 2014 Ask.com
     **truncated

=============================================== ==============================

还有第二个脚本可以从上面的html文件中提取所有链接并将其放入另一个文件中。这是脚本:

=============================================== ==============================

  use lib '/Users/lialin/perl5/lib/perl5';
          use strict; use warnings;
          use feature 'say';
     use File::Slurp 'slurp';  # makes it

 easy to read files.
     use Mojo;
     use Mojo::UserAgent;
     use URI;
     use File::Path qw(make_path remove_tree);


     #my $html_file = shift @ARGV; # take file from command lin

     my $calls_dir = "Ask/";
     opendir(my $search_dir, $calls_dir) or die "$!\n";
     my @html_files = grep /\.html$/i, readdir $search_dir;
     closedir $search_dir;
     #print "Got ", scalar @files, " files\n";

     #my %seen = ();
     foreach my $html_files (@html_files) {
        my %seen = ();
        my $current_file = $calls_dir . $html_files;
        open my $FILE, '<', $current_file or die "$html_files: $!\n";

     my $dom = Mojo::DOM->new(scalar slurp $calls_dir .$html_files);
     print $calls_dir .$html_files ;

     #for my $csshref ($dom->find('a[href]')->attr('href')->each) {
     #for my $link ($dom->find('a[href]')->attr('href')->each) {
     #  print $1;
     #say $1 #if $link->attr('href') =~ m{^https?://(.+?)/index\.php}s;
     make_path('Ask/Parsed/Html/');
     open my $fh, '>', "Ask/Parsed/Html/${html_files}.result.txt" or die $!;
     for my $csshref ($dom->find('a[href]')->attr('href')->each) {
     my $cssurl = URI->new($csshref)->abs($calls_dir .$html_files);

     #open my $fh, '>', "Ask/${html_files}.result.txt" or die $!;
     $fh->print("$html_files\n");
     $fh->print("$cssurl\n");
     #$fh->print("\t"."$_\n");
     #print "$cssurl\n";
     #print $file."\t"."$_\n";}}

=============================================== =====

生成的文件如下所示(再次使用Activision作为示例):

=============================================== ==============================

    Activision_ask.html
     http://www.ask.com/answers/browse?     qsrc=167&q=Activision+video+game&qo=channelNavigation&o=0&l=dir
     Activision_ask.html
     http://www.ask.com/answers/browse?qsrc=167&q=Activision+video+game&o=0&l=dir#opensignin
     Activision_ask.html
     http://www.ask.com/answers/profile?qsrc=3099
     Activision_ask.html
     http://www.ask.com/answers/profile?qsrc=3099
     Activision_ask.html
     javascript:void(0);
     Activision_ask.html
     http://www.ask.com/advancedsearch?     qsrc=167&q=Activision+video+game&qo=channelNavigation&o=0&l=dir
     Activision_ask.html
     http://www.ask.com/?o=0&l=dir&qsrc=14137
     Activision_ask.html
     http://www.ask.com/pictures?q=Activision+video+game&qsrc=167&qo=channelNavigation&o=0&l=dir
     Activision_ask.html
     http://www.ask.com/news?q=Activision+video+game&qsrc=167&qo=channelNavigation&o=0&l=dir
     Activision_ask.html
     http://www.ask.com/youtube?q=Activision+video+game&qsrc=167&qo=channelNavigation&o=0&l=dir
     Activision_ask.html
     http://www.ask.com/shopping?q=Activision+video+game&qsrc=167&qo=channelNavigation&o=0&l=dir
     Activision_ask.html
     javascript:void(0);
     Activision_ask.html
     http://www.ask.com/maps?q=Activision+video+game&qsrc=167&qo=channelNavigation&o=0&l=dir
     Activision_ask.html
     javascript:void(0);
     Activision_ask.html
     http://www.ask.com/web?q=Video+Game+Cheats&qsrc=466&o=0&l=dir&qo=relatedSearchNarrow
     Activision_ask.html
     http://www.ask.com/web?q=Video+Game+Tester&qsrc=466&o=0&l=dir&qo=relatedSearchNarrow
     Activision_ask.html
     http://www.ask.com/web?q=Create+Your+Own+Video+Games&qsrc=466&o=0&l=dir&qo=relatedSearchNarrow
     Activision_ask.html
     http://www.ask.com/web?q=First+Video+Game+Invented&qsrc=466&o=0&l=dir&qo=relatedSearchNarrow
     Activision_ask.html
     http://www.ask.com/web?q=Video+Game+Design&qsrc=466&o=0&l=dir&qo=relatedSearchNarrow
     Activision_ask.html
     http://www.ask.com/web?q=Wii&qsrc=466&o=0&l=dir&qo=relatedSearchExpand
     Activision_ask.html
     http://www.ask.com/web?q=Video+Game+Designer+Career&qsrc=466&o=0&l=dir&qo=relatedSearchNarrow
     Activision_ask.html
     http://www.ask.com/web?q=Video+Game+Companies&qsrc=466&o=0&l=dir&qo=relatedSearchNarrow
     Activision_ask.html
     http://www.ask.com/web?q=Spider-man+3+Video+Game&qsrc=466&o=0&l=dir&qo=relatedSearchNarrow
     Activision_ask.html
     http://www.ask.com/web?q=Video+Game+Walkthroughs&qsrc=466&o=0&l=dir&qo=relatedSearchNarrow
     Activision_ask.html
     http://www.ask.com/web?q=Video+Game+Statistics&qsrc=466&o=0&l=dir&qo=relatedSearchNarrow
     Activision_ask.html
     http://www.ask.com/web?q=Call+of+Duty+4&qsrc=466&o=0&l=dir&qo=relatedSearchExpand
     Activision_ask.html
     http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-     keywords=activision&x=0&y=0&tag=askcom05-20
     Activision_ask.html
     http://www.amazon.com/Activision-Anthology-PlayStation-  2/dp/B00006Z7HQ%3Fpsc%3D1%26SubscriptionId%3D06KMPSHEDSXXQMQVT482%26tag%3Daskcom05-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3DB00006Z7HQ
Activision_ask.html
http://www.amazon.com/Activision-Anthology-PlayStation-2/dp/B00006Z7HQ%3Fpsc%3D1%26SubscriptionId%3D06KMPSHEDSXXQMQVT482%26tag%3Daskcom05-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3DB00006Z7HQ
     Activision_ask.html
     http://www.amazon.com/Destiny-Xbox-360/dp/B002I096Q4%3Fpsc%3D1%26SubscriptionId%3D06KMPSHEDSXXQMQVT482%26tag%3Daskcom05-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3DB002I096Q4
     Activision_ask.html
     http://www.amazon.com/Destiny-Xbox-360/dp/B002I096Q4%3Fpsc%3D1%26SubscriptionId%3D06KMPSHEDSXXQMQVT482%26tag%3Daskcom05-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3DB002I096Q4
     Activision_ask.html
     http://www.amazon.com/Skylanders-Trap-Team-Not-Machine-Specific/dp/B00NCA6ZT0%3Fpsc%3D1%26SubscriptionId%3D06KMPSHEDSXXQMQVT482%26tag%3Daskcom05-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3DB00NCA6ZT0
     Activision_ask.html
     http://www.amazon.com/Skylanders-Trap-Team-Not-Machine-Specific/dp/B00NCA6ZT0%3Fpsc%3D1%26SubscriptionId%3D06KMPSHEDSXXQMQVT482%26tag%3Daskcom05-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3DB00NCA6ZT0
     Activision_ask.html
     http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&field-keywords=activision&x=0&y=0&tag=askcom05-20
     Activision_ask.html
     http://www.ask.com/wiki/Activision
     Activision_ask.html
     http://www.ask.com/wiki/Activision
     Activision_ask.html
     http://en.wikipedia.org/wiki/File:Activision.svg
     Activision_ask.html
     http://www.ask.com/allabout?q=video%20game%20publisher&qsrc=470
     Activision_ask.html
     http://www.ask.com/allabout?q=video%20game%20console&qsrc=470
     Activision_ask.html
     http://www.ask.com/allabout?q=Atari%202600&qsrc=470
     Activision_ask.html
     http://www.ask.com/wiki/Activision
     Activision_ask.html
     http://www.ask.com/wiki/Activision#Upcoming_games
     Activision_ask.html
     http://www.ask.com/wiki/Activision#References
     Activision_ask.html
     http://en.wikipedia.org/wiki/Activision
     Activision_ask.html
     http://www.ask.com/web?q=Who+was+the+Video+game+publisher+of+LOOM%3F&qsrc=469&o=0&l=dir&qo=relatedQuestions
     Activision_ask.html
     http://www.ask.com/web?q=Activision+video+game&qsrc=3060&o=0&l=dir
     Activision_ask.html
     http://www.activision.com/
     Activision_ask.html
     http://www.activision.com/games
     Activision_ask.html
     http://clk.about.com?zi=13/1tO&ity=boostOrg&o=0&ldid=4451&eng=boost&zu=http://vgstrategies.about.com/od/gameboycheatscodes/a/Activision-Anthology.htm
     http://www.gametrailers.com/company/pou3yf/activision
     Activision_ask.html
     http://www.cnbc.com/id/102026893
     Activision_ask.html
     http://www.giantbomb.com/activision/3010-78/
     Activision_ask.html
     http://www.ask.com/web?q=History+of+Video+Game+Systems&qsrc=467&o=0&l=dir&qo=relatedSearchNarrow
     Activision_ask.html
     http://www.ask.com/mobile?&o=0&l=dir&qsrc=0
     Activision_ask.html
     http://help.ask.com
     Activision_ask.html
     http://feedback.ask.com

=============================================== ============================== 现在我正在处理一个最终脚本,该脚本将使用文件名和字符串的一部分来读取包含匹配或接近匹配文本的文件中的一行或多行。

在上面的示例中,我对&#39; http://www.activision.com/games&#39;感兴趣或基本上任何带有&#39; Activision&#39;从文件名和“游戏”这个词开始在它。

我的文件名明显非常大,文字游戏可能在文件名之前或之后。

我希望解释和代码能帮助其他人理解我想要实现的目标。

我现在遇到的问题是用于搜索字符串的regex命令。我正在努力降低其严格性,并且无法使匹配正常工作。

正如我之前提到的,我非常精通html和java,但我知道perl是正确的语言,显然不是专家(如果你看看我上面的代码)但是试着学习并完成我的任务。

2 个答案:

答案 0 :(得分:2)

我不清楚你想做什么,但是给出了你的示例文件名

PieceIwanttomatch_don't_care_about_this.txt

假设您要查找前七个字符PieceIw的所有文件,这些字符也以您要编写的.txt结尾

if ( /^PieceIw.*\.txt$/ ) { ... }

我希望有帮助


<强>更新

好的,我想要你想要的是搜索目录中的所有.txt个文件,查找包含文件名的前N个字符以及其他一些指定字符串的行。 / p>

如果你不知道哪个会首先出现 - 文件名前缀或另一个字符串 - 那么你就是双向前进的右边一行。一个改进是将字符串括在\Q...\E中,它会转义所有非单词字符,以防止任何正则表达式元字符弄乱模式。

还请注意以下内容

  • 我已使用autodie,正如我在回答您之前的问题时所解释的那样。如果您在v5.10之前运行的是Perl版本并且无法升级,那么您将无法执行此操作并且必须单独检查每个文件操作的状态

  • 对目录使用绝对路径非常重要;否则用户必须确保他们在运行程序之前拥有正确的当前工作目录

  • 我已将所有参数都放到程序中 - 两个目录和要搜索的附加字符串 - 作为程序顶部的定义

  • 我已使用glob代替opendir / readdir / grep,因为它更整洁,因此文件名称也是如此包括完整路径

use strict;
use warnings;
use 5.010;
use autodie;

use File::Path qw/ make_path remove_tree /;
use File::Basename qw/ fileparse /;

my $calls_dir  = '/path/to/Ask/Parsed/Html';
my $parsed_dir = '/path/to/Ask/Parsed/Html2';
my $wanted     = 'game';

my @files = glob "$calls_dir/*.txt";

printf "Got %d files\n", scalar @files;

for my $file (@files) {

  open my $in_fh, '<', $file;

  my $prefix = substr $file, 0, 8;
  print $prefix, "\n";

  my $basename = fileparse($file);
  make_path($parsed_dir);
  open my $out_fh, '>', "$parsed_dir/${basename}_parsed_for_management.txt";

  while (<$in_fh>) {
    print $out_fh $_ if / \Q$prefix\E .* \Q$wanted\E /x;
  }

  close $out_fh;
}

<强>更新

这很好用

my ($wanted, $prefix) = qw/ game nintendo /;

for ( 'game.nintendo.com/phoenix.zhtml?c=121127&p=irol-gom' ) {
  print "OK\n" if / \Q$wanted\E .* \Q$prefix\E /x;
}

<强>输出

OK

答案 1 :(得分:0)

有些推测,试图在这里读取。

opendir(my $search_dir, $calls_dir) or die "$!\n";
my @files = grep /^${prefix}_/ grep /\.txt$/i, readdir $search_dir;
closedir $search_dir;

现在@files仅包含名称以.txt开头,后跟下划线的$prefix个文件。您不希望搜索除这些之外的任何其他文件。我正在推测下划线,但你可以修改它以更好地满足你的需求,如果不是这样的话。

现在,搜索(仅)搜索匹配的文件。

for my $file (@files) {
  my $current_file = $calls_dir . $file;
  open my $FILE, '<', $current_file or die "$file: $!\n";    
  while (<$FILE>) {
      print "$file\n$" if m/management/;
  }
}

我实际上建议使用制表符或冒号分隔符,而不是文件名和匹配行之间的换行符。面向行的输出更容易使用。

当然,所有这些只是

grep management "$prefix"_*.txt >output

在一行shell脚本中。