Unix使用aw​​k连接两个带正则表达式的文件

时间:2011-04-06 11:30:07

标签: regex bash unix awk

我有一个文件(lookup.txt),其中包含一个由正则表达式列表组成的查找表,以及相应的数据(类别和句点)。 e.g。

INTERNODE|household/bills/broadband|monthly
ORIGIN ENERGY|household/bills/electricity|quarterly
TELSTRA.*BILL|household/bills/phone|quarterly
OPTUS|household/bills/mobile|quarterly
SKYPE|household/bills/skype|non-periodic

我有另一个文件(data.txt),其中包含费用清单,例如:

2009-10-31,cc,-39.9,INTERNODE BROADBAND
2009-10-31,cc,-50,ORIGIN ENERGY 543546
2009-10-31,cc,-68,INTERNODE BROADBAND EXCESS CHARGES
2009-10-31,cc,-90,TELSTRA MOBILE BILL
2009-11-02,cc,-320,TELSTRA HOME BILL
2009-11-03,cc,-22.96,DICK SMITH
2009-11-03,cc,-251.24,BUNNINGS
2009-11-04,cc,-4.2,7-ELEVEN

我想将这两者结合在一起,其中data.txt文件中的第4列与lookup.txt文件的第一列中的正则表达式匹配。

所以输出结果为:

2009-10-31,cc,-39.9,INTERNODE BROADBAND,household/bills/broadband,monthly
2009-10-31,cc,-50,ORIGIN ENERGY 543546,household/bills/electricity,quarterly
2009-10-31,cc,-68,INTERNODE BROADBAND EXCESS CHARGES,household/bills/broadband,monthly
2009-10-31,cc,-90,TELSTRA MOBILE BILL,household/bills/phone,quarterly
2009-11-02,cc,-320,TELSTRA HOME BILL,household/bills/phone,quarterly
2009-11-03,cc,-22.96,DICK SMITH
2009-11-03,cc,-251.24,BUNNINGS
2009-11-04,cc,-4.2,7-ELEVEN

我使用bash循环实现了这一点,循环查找,执行greps并使用sed添加额外的列,但速度非常慢。所以想知道是否有更快的方法来做这个,说使用awk。

任何帮助都将不胜感激。

5 个答案:

答案 0 :(得分:3)

$ awk -F'|' 'FNR==NR{a[$1]=$2","$3;next}{m=split($0,b,",");for(i in a){if(b[4]~i){print $0","a[i];next}}}1' lookup file
2009-10-31,cc,-39.9,INTERNODE BROADBAND,household/bills/broadband,monthly
2009-10-31,cc,-50,ORIGIN ENERGY 543546,household/bills/electricity,quarterly
2009-10-31,cc,-68,INTERNODE BROADBAND EXCESS CHARGES,household/bills/broadband,monthly
2009-10-31,cc,-90,TELSTRA MOBILE BILL,household/bills/phone,quarterly
2009-11-02,cc,-320,TELSTRA HOME BILL,household/bills/phone,quarterly
2009-11-03,cc,-22.96,DICK SMITH
2009-11-03,cc,-251.24,BUNNINGS
2009-11-04,cc,-4.2,7-ELEVEN

答案 1 :(得分:1)

你可以用Python做到这一点:

#!/usr/bin/python
import csv, re
lookup = []
with open('lookup.txt') as f:
    for rec in csv.reader(f, delimiter='|'):
        lookup.append((re.compile(rec[0]), rec[1:]))
with open('data.txt') as f:
    for rec in csv.reader(f, delimiter=','):
        for rexp, fields in lookup:
            if rexp.match(rec[3]):
                rec.extend(fields)
                break
        print ','.join(rec)

对于您的文件lookup.txtdata.txt,它会在不到0.3秒的时间内返回以下内容:

2009-10-31,cc,-39.9,INTERNODE BROADBAND,household/bills/broadband,monthly
2009-10-31,cc,-50,ORIGIN ENERGY 543546,household/bills/electricity,quarterly
2009-10-31,cc,-68,INTERNODE BROADBAND EXCESS CHARGES,household/bills/broadband,monthly
2009-10-31,cc,-90,TELSTRA MOBILE BILL,household/bills/phone,quarterly
2009-11-02,cc,-320,TELSTRA HOME BILL,household/bills/phone,quarterly
2009-11-03,cc,-22.96,DICK SMITH
2009-11-03,cc,-251.24,BUNNINGS
2009-11-04,cc,-4.2,7-ELEVEN

答案 2 :(得分:0)

你可以在Perl中完成。 Perl(或Python)的优点是它们具有用于处理CSV文件的库。您的示例很简单,但如果您在双引号内有逗号会怎样?或者utf8怎么样?等

标准的Perl库是Text:CSV_XS。但是,它有点冗长,我更喜欢Parse::CSV,它是Text :: CSV_XS的包装。

#!/usr/bin/perl

use strict;
use warnings;
use Parse::CSV;

my %lookup;
my $l = Parse::CSV->new(file => "lookup.txt", sep_char => '|');
while (my $row = $l->fetch) {
   my $key = qr/$row->[0]/;
   $lookup{$key} = [$row->[1,]];
}

my $d = Parse::CSV->new(file => "data.txt");
while (my $row = $d->fetch) {
   foreach my $regex (keys %lookup) {
      if ($row->[3] =~ $regex) {
         push @$row, @{$lookup{$regex}};
         last;
      }
   }
   print join(",", @$row), "\n";
}

答案 3 :(得分:0)

如果您没有正则表达式,可以使用joinlookup.txt有多少个正则表达式?如果只是那个,只需展开它并删除该功能。

答案 4 :(得分:0)

Awk实际上是设计为一次处理一个记录的单个数据流,因此它不适合这项工作。这将是Perl或其他语言的十分钟练习,更倾向于通用编程。

如果您只是想在awk中完成所有操作,请编写一个脚本以从查找文件生成第二个awk脚本来处理数据,然后运行第二个脚本。