我有一个文件(lookup.txt),其中包含一个由正则表达式列表组成的查找表,以及相应的数据(类别和句点)。 e.g。
INTERNODE|household/bills/broadband|monthly
ORIGIN ENERGY|household/bills/electricity|quarterly
TELSTRA.*BILL|household/bills/phone|quarterly
OPTUS|household/bills/mobile|quarterly
SKYPE|household/bills/skype|non-periodic
我有另一个文件(data.txt),其中包含费用清单,例如:
2009-10-31,cc,-39.9,INTERNODE BROADBAND
2009-10-31,cc,-50,ORIGIN ENERGY 543546
2009-10-31,cc,-68,INTERNODE BROADBAND EXCESS CHARGES
2009-10-31,cc,-90,TELSTRA MOBILE BILL
2009-11-02,cc,-320,TELSTRA HOME BILL
2009-11-03,cc,-22.96,DICK SMITH
2009-11-03,cc,-251.24,BUNNINGS
2009-11-04,cc,-4.2,7-ELEVEN
我想将这两者结合在一起,其中data.txt文件中的第4列与lookup.txt文件的第一列中的正则表达式匹配。
所以输出结果为:
2009-10-31,cc,-39.9,INTERNODE BROADBAND,household/bills/broadband,monthly
2009-10-31,cc,-50,ORIGIN ENERGY 543546,household/bills/electricity,quarterly
2009-10-31,cc,-68,INTERNODE BROADBAND EXCESS CHARGES,household/bills/broadband,monthly
2009-10-31,cc,-90,TELSTRA MOBILE BILL,household/bills/phone,quarterly
2009-11-02,cc,-320,TELSTRA HOME BILL,household/bills/phone,quarterly
2009-11-03,cc,-22.96,DICK SMITH
2009-11-03,cc,-251.24,BUNNINGS
2009-11-04,cc,-4.2,7-ELEVEN
我使用bash循环实现了这一点,循环查找,执行greps并使用sed添加额外的列,但速度非常慢。所以想知道是否有更快的方法来做这个,说使用awk。
任何帮助都将不胜感激。
答案 0 :(得分:3)
$ awk -F'|' 'FNR==NR{a[$1]=$2","$3;next}{m=split($0,b,",");for(i in a){if(b[4]~i){print $0","a[i];next}}}1' lookup file
2009-10-31,cc,-39.9,INTERNODE BROADBAND,household/bills/broadband,monthly
2009-10-31,cc,-50,ORIGIN ENERGY 543546,household/bills/electricity,quarterly
2009-10-31,cc,-68,INTERNODE BROADBAND EXCESS CHARGES,household/bills/broadband,monthly
2009-10-31,cc,-90,TELSTRA MOBILE BILL,household/bills/phone,quarterly
2009-11-02,cc,-320,TELSTRA HOME BILL,household/bills/phone,quarterly
2009-11-03,cc,-22.96,DICK SMITH
2009-11-03,cc,-251.24,BUNNINGS
2009-11-04,cc,-4.2,7-ELEVEN
答案 1 :(得分:1)
你可以用Python做到这一点:
#!/usr/bin/python
import csv, re
lookup = []
with open('lookup.txt') as f:
for rec in csv.reader(f, delimiter='|'):
lookup.append((re.compile(rec[0]), rec[1:]))
with open('data.txt') as f:
for rec in csv.reader(f, delimiter=','):
for rexp, fields in lookup:
if rexp.match(rec[3]):
rec.extend(fields)
break
print ','.join(rec)
对于您的文件lookup.txt
和data.txt
,它会在不到0.3秒的时间内返回以下内容:
2009-10-31,cc,-39.9,INTERNODE BROADBAND,household/bills/broadband,monthly
2009-10-31,cc,-50,ORIGIN ENERGY 543546,household/bills/electricity,quarterly
2009-10-31,cc,-68,INTERNODE BROADBAND EXCESS CHARGES,household/bills/broadband,monthly
2009-10-31,cc,-90,TELSTRA MOBILE BILL,household/bills/phone,quarterly
2009-11-02,cc,-320,TELSTRA HOME BILL,household/bills/phone,quarterly
2009-11-03,cc,-22.96,DICK SMITH
2009-11-03,cc,-251.24,BUNNINGS
2009-11-04,cc,-4.2,7-ELEVEN
答案 2 :(得分:0)
你可以在Perl中完成。 Perl(或Python)的优点是它们具有用于处理CSV文件的库。您的示例很简单,但如果您在双引号内有逗号会怎样?或者utf8怎么样?等
标准的Perl库是Text:CSV_XS。但是,它有点冗长,我更喜欢Parse::CSV,它是Text :: CSV_XS的包装。
#!/usr/bin/perl
use strict;
use warnings;
use Parse::CSV;
my %lookup;
my $l = Parse::CSV->new(file => "lookup.txt", sep_char => '|');
while (my $row = $l->fetch) {
my $key = qr/$row->[0]/;
$lookup{$key} = [$row->[1,]];
}
my $d = Parse::CSV->new(file => "data.txt");
while (my $row = $d->fetch) {
foreach my $regex (keys %lookup) {
if ($row->[3] =~ $regex) {
push @$row, @{$lookup{$regex}};
last;
}
}
print join(",", @$row), "\n";
}
答案 3 :(得分:0)
如果您没有正则表达式,可以使用join
。 lookup.txt
有多少个正则表达式?如果只是那个,只需展开它并删除该功能。
答案 4 :(得分:0)
Awk实际上是设计为一次处理一个记录的单个数据流,因此它不适合这项工作。这将是Perl或其他语言的十分钟练习,更倾向于通用编程。
如果您只是想在awk中完成所有操作,请编写一个脚本以从查找文件生成第二个awk脚本来处理数据,然后运行第二个脚本。