从CSV文件中删除一行内的双打

时间:2016-08-05 15:05:56

标签: regex perl sqlite csv awk

输入文件来自group_concatenated SQL查询输出,其中存在一些重复值。它已经与DISTINCT一起使用,但这还不够,因为只有一些子串是相同的。

所以,我感兴趣的专栏是第9栏。 我们的想法是,只在一行打印非重复的IAB类别。

该文件中的示例:

148422,0.72499999999999998,0.72499999999999998,0.72500000000165021,wpolityce.pl,300x250,standard,3,"IAB3;IAB11;IAB17;IAB12;IAB9;IAB15;IAB23,IAB3;IAB11;IAB17;IAB12;IAB9;IAB13;IAB23,IAB3;IAB11;IAB12;IAB9"
118243,0.72499999999999998,0.72499999999999998,0.72500000000058573,wpolityce.pl,728x90,standard,3,"IAB3;IAB11;IAB1;IAB12;IAB13;IAB23,IAB3;IAB11;IAB12;IAB13;IAB23,IAB3;IAB11;IAB12;IAB9"
118243,0.72499999999999998,0.72499999999999998,0.72500000000058573,wpolityce.pl,750x100,standard,3,"IAB3;IAB11;IAB1;IAB12;IAB13;IAB23,IAB3;IAB11;IAB12;IAB13;IAB23,IAB3;IAB11;IAB12;IAB9"
118243,0.72499999999999998,0.72499999999999998,0.72500000000058573,wpolityce.pl,750x200,standard,3,"IAB3;IAB11;IAB1;IAB12;IAB13;IAB23,IAB3;IAB11;IAB12;IAB13;IAB23,IAB3;IAB11;IAB12;IAB9"
118243,0.72499999999999998,0.72499999999999998,0.72500000000058573,wpolityce.pl,750x300,standard,3,"IAB3;IAB11;IAB1;IAB12;IAB13;IAB23,IAB3;IAB11;IAB12;IAB13;IAB23,IAB3;IAB11;IAB12;IAB9"

我想删除重复的IAB类别,因此对于第一行,它将如下所示:

148422,0.72499999999999998,0.72499999999999998,0.72500000000165021,wpolityce.pl,300x250,standard,3,"IAB3;IAB11;IAB17;IAB12;IAB9;IAB15;IAB23;IAB13;IAB23"

在我的SQL查询中,我有类似这样的内容:

SELECT GROUP_CONCAT(DISTINCT foo) FROM t;

现在foo-column可以包含这些行的值:

foo
bar
qrr
foo;bar
foo;qrr
foo
foo;qrr
bar
qrr
foo

使用DISTINCT连接这些值将删除所有直接重复项。分开,如下:

foo
bar
qrr
foo;bar
foo;qrr

我对个人价值foobarqrr)感兴趣。如果用于连接的分隔符为;,则看起来好像并非所有重复项都被删除。

;连接后该列中的最终输出应为:

foo;bar;baz

如何删除这些副本?

我试着去做,但是我在AWK等方面并不是那么先进。

我正在和Bash合作,虽然我也可以在SQLite中“提前一步”。

3 个答案:

答案 0 :(得分:1)

只要要处理的列始终是双引号中的唯一一个,并且可以用分号替换所有分隔符,这将按照您的要求执行

use strict;
use warnings 'all';

use List::Util 'uniq';

while ( <> ) {
    s{ " ([^"]+) " }{ '"' . join(';', uniq $1 =~ /\w+/g) . '"' }ex;
    print;
}

输出

148422,0.72499999999999998,0.72499999999999998,0.72500000000165021,wpolityce.pl,300x250,standard,3,"IAB3;IAB11;IAB17;IAB12;IAB9;IAB15;IAB23;IAB13"
118243,0.72499999999999998,0.72499999999999998,0.72500000000058573,wpolityce.pl,728x90,standard,3,"IAB3;IAB11;IAB1;IAB12;IAB13;IAB23;IAB9"
118243,0.72499999999999998,0.72499999999999998,0.72500000000058573,wpolityce.pl,750x100,standard,3,"IAB3;IAB11;IAB1;IAB12;IAB13;IAB23;IAB9"
118243,0.72499999999999998,0.72499999999999998,0.72500000000058573,wpolityce.pl,750x200,standard,3,"IAB3;IAB11;IAB1;IAB12;IAB13;IAB23;IAB9"
118243,0.72499999999999998,0.72499999999999998,0.72500000000058573,wpolityce.pl,750x300,standard,3,"IAB3;IAB11;IAB1;IAB12;IAB13;IAB23;IAB9"

答案 1 :(得分:-1)

template<int L> 
class FP {
public:
    int n;

    template<int K> 
    FP<L+K> add(FP<K> a) {
        FP<L+K> r;
        r.n = n+a.n;
        return r;
    }

    template<int K> int addS(FP<K> a) {
        return L+K;
    }
};

int main()
{
   FP<1> n1;
   FP<2> n2;
   FP<n1.addS(n2)> n3 = n1.add(n2);
 }

答案 2 :(得分:-1)

$ awk '
BEGIN { FS=OFS="\"" }
{
    split($2,iabs,/[,;]/)
    tmp = ""
    delete seen
    for (i=1;i in iabs;i++) {
        if (!seen[iabs[i]]++) {
            tmp = (tmp ? "" : tmp ";") iabs[i]
        }
    }
    $2 = tmp
}
1
' file
148422,0.72499999999999998,0.72499999999999998,0.72500000000165021,wpolityce.pl,300x250,standard,3,"IAB3;IAB11;IAB17;IAB12;IAB9;IAB15;IAB23;IAB13"
118243,0.72499999999999998,0.72499999999999998,0.72500000000058573,wpolityce.pl,728x90,standard,3,"IAB3;IAB11;IAB1;IAB12;IAB13;IAB23;IAB9"
118243,0.72499999999999998,0.72499999999999998,0.72500000000058573,wpolityce.pl,750x100,standard,3,"IAB3;IAB11;IAB1;IAB12;IAB13;IAB23;IAB9"
118243,0.72499999999999998,0.72499999999999998,0.72500000000058573,wpolityce.pl,750x200,standard,3,"IAB3;IAB11;IAB1;IAB12;IAB13;IAB23;IAB9"
118243,0.72499999999999998,0.72499999999999998,0.72500000000058573,wpolityce.pl,750x300,standard,3,"IAB3;IAB11;IAB1;IAB12;IAB13;IAB23;IAB9"