Bash-比grep更有效的处理csv文件的方法

时间:2019-03-17 20:42:06

标签: bash grep

已更新

我有一个文件(file.txt)并带有单词列表:

apple
banana
cherry
orange
pineapples

我有一个包含大量数据的csv文件(data.csv):

1,"tasty apples",3,5
23,"iphone app",5,12
1,"sour grapes",3,5
23,"banana apple smoothie",5,12
1,"cherries and orange shortage",3,5
23,"apple iphone orange cover",5,12
3,"pineapple cherry bubble gum",13,5
5,"pineapples are best frozen",22,33

我想从文件(output.csv)中添加匹配项:

1,"tasty apples",3,5,""
23,"iphone app",5,12,""
1,"sour grapes",3,5,""
23,"banana apple smoothie",5,12,"apple+banana"
1,"cherries and orange shortage",3,5,"orange"
23,"apple iphone orange cover",5,12,"apple+orange"
3,"pineapple cherry bubble gum",13,5,"cherry"
5,"pineapples are best frozen",22,33,"pineapples"

我可以使用grep来做到这一点,但是为了做到这一点,我必须对while语句使用if循环并处理文本文件。

这样做的问题是file.txt有大约500行,而data.csv有330,000行。我的脚本可以运行,但是可能需要几天才能完成。

我想知道有比我的方法更有效的方法吗?

2 个答案:

答案 0 :(得分:2)

抢救Perl!

#!/usr/bin/perl
use warnings;
use strict;

use Text::CSV_XS qw{ csv };

open my $f1, '<', 'file.txt' or die $!;
my @fruits;
chomp, push @fruits, $_ while <$f1>;
my %order;
@order{@fruits} = 0 .. $#fruits;

my $regex = join '|', sort { length $b <=> length $a } @fruits;

csv(
    in          => 'data.csv1',
    eol         => "\n",
    on_in       => sub {
        my @matches;
        push @matches, $1 while $_[1][1] =~ /\b($regex)\b/g;
        push @{ $_[1] }, join '+',
                         sort { $order{$a} <=> $order{$b} }
                         @matches;
    },
);

不幸的是,Text::CSV_XS如果不包含特殊字符(或者不引用所有字段),则不能引用最后一个字段。但是,如果file.txt不包含双引号和逗号,则可以轻松添加它们:

perl ... | sed 's/,\([^,"]*\)$/,"\1"/'

答案 1 :(得分:1)

您是否有理由要引用最后一个字段? CSV中的“ +”没有特殊含义,因此不需要引号,也不需要空白字段。 Text :: CSV_XS确实支持空字段或所有字段的引用,但尚不支持所有非数字字段的引用。 基于choroba的答案,该答案允许最后一个字段为“ apple + apple + orange”(如果需要的话,在OP中没有明确定义),我会这样写:

use 5.14.1;
use warnings;
use Text::CSV_XS qw( csv );
use Data::Peek;

chomp (my @fruits = do { local @ARGV = "file.txt"; <> });

my %order;
@order{@fruits} = 0 .. $#fruits;

my $regex = join "|", sort { length $b <=> length $a } @fruits;

csv (
    in          => "data1.csv",
    eol         => "\n",
    quote_empty => 1,
    on_in       => sub {
        push @{$_[1]}, join "+" =>
            sort { $order{$a} <=> $order{$b} }
            keys %{{map { $_ => 1 }
                    ($_[1][1] =~ m/\b($regex)\b/g)}};
        },
    );