已更新
我有一个文件(file.txt
)并带有单词列表:
apple
banana
cherry
orange
pineapples
我有一个包含大量数据的csv文件(data.csv
):
1,"tasty apples",3,5
23,"iphone app",5,12
1,"sour grapes",3,5
23,"banana apple smoothie",5,12
1,"cherries and orange shortage",3,5
23,"apple iphone orange cover",5,12
3,"pineapple cherry bubble gum",13,5
5,"pineapples are best frozen",22,33
我想从文件(output.csv
)中添加匹配项:
1,"tasty apples",3,5,""
23,"iphone app",5,12,""
1,"sour grapes",3,5,""
23,"banana apple smoothie",5,12,"apple+banana"
1,"cherries and orange shortage",3,5,"orange"
23,"apple iphone orange cover",5,12,"apple+orange"
3,"pineapple cherry bubble gum",13,5,"cherry"
5,"pineapples are best frozen",22,33,"pineapples"
我可以使用grep
来做到这一点,但是为了做到这一点,我必须对while
语句使用if
循环并处理文本文件。
这样做的问题是file.txt有大约500行,而data.csv有330,000行。我的脚本可以运行,但是可能需要几天才能完成。
我想知道有比我的方法更有效的方法吗?
答案 0 :(得分:2)
抢救Perl!
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS qw{ csv };
open my $f1, '<', 'file.txt' or die $!;
my @fruits;
chomp, push @fruits, $_ while <$f1>;
my %order;
@order{@fruits} = 0 .. $#fruits;
my $regex = join '|', sort { length $b <=> length $a } @fruits;
csv(
in => 'data.csv1',
eol => "\n",
on_in => sub {
my @matches;
push @matches, $1 while $_[1][1] =~ /\b($regex)\b/g;
push @{ $_[1] }, join '+',
sort { $order{$a} <=> $order{$b} }
@matches;
},
);
不幸的是,Text::CSV_XS如果不包含特殊字符(或者不引用所有字段),则不能引用最后一个字段。但是,如果file.txt
不包含双引号和逗号,则可以轻松添加它们:
perl ... | sed 's/,\([^,"]*\)$/,"\1"/'
答案 1 :(得分:1)
您是否有理由要引用最后一个字段? CSV中的“ +”没有特殊含义,因此不需要引号,也不需要空白字段。 Text :: CSV_XS确实支持空字段或所有字段的引用,但尚不支持所有非数字字段的引用。 基于choroba的答案,该答案允许最后一个字段为“ apple + apple + orange”(如果需要的话,在OP中没有明确定义),我会这样写:
use 5.14.1;
use warnings;
use Text::CSV_XS qw( csv );
use Data::Peek;
chomp (my @fruits = do { local @ARGV = "file.txt"; <> });
my %order;
@order{@fruits} = 0 .. $#fruits;
my $regex = join "|", sort { length $b <=> length $a } @fruits;
csv (
in => "data1.csv",
eol => "\n",
quote_empty => 1,
on_in => sub {
push @{$_[1]}, join "+" =>
sort { $order{$a} <=> $order{$b} }
keys %{{map { $_ => 1 }
($_[1][1] =~ m/\b($regex)\b/g)}};
},
);