根据其他文件中的关键字在文件目录中搜索

时间:2013-06-06 03:54:07

标签: linux perl unix

Perl新手在这里寻找帮助。

我有一个文件目录和一个“关键字”文件,其中包含要搜索的属性和属性类型。

例如:

Keywords.txt

Attribute1 boolean
Attribute2 boolean
Attribute3 search_and_extract
Attribute4 chunk

对于目录中的每个文件,我必须:

  • 查找keywords.txt
  • 根据Attribute类型
  • 进行搜索

类似下面的内容。

IF attribute_type = boolean THEN
 search for attribute;
 set found = Y if attribute found;
ELSIF attribute_type = search_and_extract THEN
 extract string where attribute is Found
ELSIF attribute_type = chunk THEN
 extract the complete chunk of paragraph where attribute is found.

到目前为止,这是我所拥有的,我相信有更有效的方法可以做到这一点。

我希望有人可以指导我朝着正确的方向前进。 谢谢&问候, 司马

# Reads attributes from config file
# First set boolean attributes. IF keyword is found in text, 
# variable flag is set to Y else N
# End Code: For each  text file in directory loop. 
# Run the below for each document.

use strict;
use warnings;

# open Doc
open(DOC_FILE,'Final_CLP.txt');
while(<DOC_FILE>) {
    chomp;
    # open the file
    open(FILE,'attribute_config.txt');
    while (<FILE>) {
        chomp;
        ($attribute,$attribute_type) = split("\t");

        $is_boolean = ($attribute_type eq "boolean") ? "N" : "Y";

        # For each boolean attribute, check if the keyword exists 
        # in the file and return Y or N
        if ($is_boolean eq "Y") {
            print "Yes\n";
            # search for keyword in doc and assign values
        }   

        print "Attribute: $attribute\n";
        print "Attribute_Type: $attribute_type\n";
        print "is_boolean: $is_boolean\n";
        print "-----------\n";
    }   
    close(FILE);
}
close(DOC_FILE);
exit;

1 个答案:

答案 0 :(得分:0)

用故事开始你的规格/问题是个好主意(“我有......”)。但 这样的故事 - 无论是真的还是弥补的,因为你无法透露真相 - 应该给出

  • 情况/问题/任务的生动画面
  • 为什么必须完成所有工作的原因
  • 不常见(使用)术语的定义

所以我开始:我在监狱工作,必须扫描电子邮件

的囚犯
  • 文字中任何地方提到的名字(如“Al Capone”);导演 想要阅读这些邮件
  • 订单行(如“武器:AK 4711数量:14”);军械 军官想要那些信息来计算弹药的数量和 需要机架空间
  • 包含“家庭”关键字的段落,如“妻子”,“孩子”,......; 牧师想要有效地准备她的讲道

为自己,每个术语“关键字”(〜运行文本)和 “属性”(〜结构化文本)可以是“清除”,但如果两者都适用 为了“我必须寻找的X”,事情变得糊涂了。而不是一般(“块”) 和技术(“字符串”)术语,你应该使用'真实世界'(线)和 具体(段落)字样。您输入的样本:

From: Robin Hood
To: Scarface

Hi Scarface,

tell Al Capone to send a car to the prison gate on sunday.

For the riot we need:

weapon: AK 4711 quantity: 14
knife: Bowie quantity: 8

Tell my wife in Folsom to send some money to my son in
Alcatraz.

Regards
Robin

和您的预期输出:

--- Robin.txt ----
keywords:
  Al Capone: Yes
  Billy the Kid: No
  Scarface: Yes
order lines:
  knife:
    knife: Bowie quantity: 8
  machine gun:
  stinger rocket:
  weapon:
    weapon: AK 4711 quantity: 14
social relations paragaphs:
  Tell my wife in Folsom to send some money to my son in
  Alcatraz.

伪代码应该从顶层开始。如果你从

开始
for each file in folder
    load search list
    process current file('s content) using search list

很明显

load search list
for each file in folder
    process current file using search list

会好得多。

基于这个故事,例子和顶级计划,我会试着来 提供简化版“过程”的概念证明代码 当前文件(的内容)使用搜索列表“任务:

given file/text to search in and list of keywords/attributes

print file name
print "keywords:"
for each boolean item
  print boolean item text
  if found anywhere in whole text
     print "Yes"
  else
     print "No"
print "order line:"
for each line item
  print line item text
  if found anywhere in whole text
     print whole line
print "social relations paragaphs:"
for each paragraph
    for each social relation item
        if found
           print paragraph
           no need to check for other items

首次实施尝试:

use Modern::Perl;

#use English qw(-no_match_vars);
use English;

exit step_00();

sub step_00 {
  # given file/text to search in
  my $whole_text = <<"EOT";
From: Robin Hood
To: Scarface

Hi Scarface,

tell Al Capone to send a car to the prison gate on sunday.

For the riot we need:

weapon: AK 4711 quantity: 14
knife: Bowie quantity: 8

Tell my wife in Folsom to send some money to my son in
Alcatraz.

Regards
Robin
EOT

  #  print file name
  say "--- Robin.txt ---";
  # print "keywords:"
  say "keywords:";
  # for each boolean item
  for my $bi ("Al Capone", "Billy the Kid", "Scarface") {
  #   print boolean item text
      printf " %s: ", $bi;
  #   if found anywhere in whole text
      if ($whole_text =~ /$bi/) {
  #      print "Yes"
         say "Yes";
  #   else
      } else {
  #      print "No"
         say "No";
      }
  }
  # print "order line:"
  say "order lines:";
  # for each line item
  for my $li ("knife", "machine gun", "stinger rocket", "weapon") {
  #   print line item text
  #   if found anywhere in whole text
      if ($whole_text =~ /^$li.*$/m) {
  #      print whole line
         say " ", $MATCH;
      }
  }
  # print "social relations paragaphs:"
  say "social relations paragaphs:";
  # for each paragraph
  for my $para (split /\n\n/, $whole_text) {
  #     for each social relation item
        for my $sr ("wife", "son", "husband") {
  #         if found
            if ($para =~ /$sr/) {
        ##  if ($para =~ /\b$sr\b/) {
  #            print paragraph
               say $para;
  #            no need to check for other items
               last;
            }
        }
  }
  return 0;
}

输出:

perl 16953439.pl
--- Robin.txt ---
keywords:
 Al Capone: Yes
 Billy the Kid: No
 Scarface: Yes
order lines:
 knife: Bowie quantity: 8
 weapon: AK 4711 quantity: 14
social relations paragaphs:
tell Al Capone to send a car to the prison gate on sunday.
Tell my wife in Folsom to send some money to my son in
Alcatraz.

这样的(过早的)代码可以帮助你

  • 澄清您的规格(不应该找到关键字进入输出?
  • 您的搜索列表是否非常扁平,还是应该进行结构化/分组?)
  • 检查您对如何做事的假设(订单行应该如此 搜索是在整个文本的行数组上完成的吗?)
  • 确定进一步研究的主题/ rtfm(例如,regex(监狱!))
  • 计划您的后续步骤(文件夹循环,读取输入文件)

(此外,知情人士会指出我的所有不良行为, 所以你可以从一开始就避开它们。

祝你好运!