大型CSV文件上的大量查询

时间:2019-07-03 20:18:12

标签: mongodb csv unix awk

我正在尝试在csv的url列中查询大型csv文件(100gb,+-11亿记录)以查找部分匹配项。我的目标是查询大约23​​000个可能的匹配项。

示例输入:

url,answer,rrClass,rrType,tlp,firstSeenTimestamp,lastSeenTimestamp,minimumTTLSec,maximumTTLSec,count
maps.google.com.,173.194.112.106,in,a,white,1442011301000,1442011334000,300,300,2
drive.google.com.,173.194.112.107,in,a,white,1442011301000,1442011334000,300,300,2
nokiantires.com.,185.53.179.22,in,a,white,1529534626596,1529534626596,600,600,1
woodpapersilk.,138.201.32.142,in,a,white,1546339972354,1553285334535,3886,14399,2
xn--c1yn36f.cn.,167.160.174.76,in,a,white,1501685257255,1515592226520,14400,14400,38
maps.google.com.malwaredomain.com.,118.193.165.236,in,a,white,1442148766000,1442148766000,600,600,1
whois.ducmates.blogspot.com.,216.58.194.193,in,a,white,1535969280784,1535969280784,44,44,1

查询具有以下模式:/^.*[someurl].*$/每个[someurls]来自不同的文件,可以假定为大小为23000的数组。

匹配查询:

awk -F, '$1 ~ /^.*google\.com\.$/' > file1.out awk -F, '$1 ~ /^.*nokiantires\.com\.$/' > file2.out awk -F, '$1 ~ /^.*woodpapersilk\.com\.$/' > file3.out awk -F, '$1 ~ /^.*xn--.*$/' > file4.out

不匹配的查询:

awk -F, '$1 ~ /^.*seasonvintage\.com\.$/' > file5.out awk -F, '$1 ~ /^.*java2s\.com\.$/' > file6.out

file1.out:

maps.google.com.,173.194.112.106,in,a,white,1442011301000,1442011334000,300,300,2
drive.google.com.,173.194.112.107,in,a,white,1442011301000,1442011334000,300,300,2

file2.out:

nokiantires.com.,185.53.179.22,in,a,white,1529534626596,1529534626596,600,600,1

file3.out:

woodpapersilk.,138.201.32.142,in,a,white,1546339972354,1553285334535,3886,14399,2

file4.out:

xn--c1yn36f.cn.,167.160.174.76,in,a,white,1501685257255,1515592226520,14400,14400,38

文件5.out和file6.out都为空,因为没有匹配项 我也将这些输入和输出作为gist上传了。

基本上每个查询都会在url列中提取部分匹配项。

当前,我在awk中使用以下代码来搜索可能的匹配项:

awk -F, '$1 ~ /^.*xn--.*$/' file.out > filter.csv

此解决方案返回有效响应,但查询一个示例需要14分钟。不幸的是,我正在寻找23000个可能的匹配项。

因此,我正在寻找一种更可行,更有效的解决方案。

我已经想到/尝试了以下

  1. 我可以将所有标签包含在一个巨大的正则表达式中,还是会增加效率低下?
  2. 我曾尝试过使用MongoDB,但仅在一台计算机上无法正常工作。
  3. 我有一张AWS凭证,余额约30美元。是否有任何特定的AWS解决方案可以在这里提供帮助?

在上述csv文件上处理这些查询的解决方案会更可行吗?

非常感谢

1 个答案:

答案 0 :(得分:1)

鉴于目前为止我们所知道的并猜出了几个问题的答案,我可以通过将查询分成“可以通过哈希查找匹配的查询”(除查询之外的所有查询)来解决。在您发布的示例中)和“需要进行正则表达式比较的查询才能匹配”(在您的示例中仅为xn--.*$),然后在读取记录时对它们进行评估,以使任何$ 1都可以被几乎瞬时的哈希值匹配对所有可散列查询的查找将像这样完成,并且仅需要进行正则表达式匹配的少数几个将在循环中依次处理:

$ cat ../queries
google.com.$
nokiantires.com.$
woodpapersilk.com.$
xn--.*$
seasonvintage.com.$
java2s.com.$

$ cat ../records
url,answer,rrClass,rrType,tlp,firstSeenTimestamp,lastSeenTimestamp,minimumTTLSec,maximumTTLSec,count
maps.google.com.,173.194.112.106,in,a,white,1442011301000,1442011334000,300,300,2
drive.google.com.,173.194.112.107,in,a,white,1442011301000,1442011334000,300,300,2
nokiantires.com.,185.53.179.22,in,a,white,1529534626596,1529534626596,600,600,1
woodpapersilk.,138.201.32.142,in,a,white,1546339972354,1553285334535,3886,14399,2
xn--c1yn36f.cn.,167.160.174.76,in,a,white,1501685257255,1515592226520,14400,14400,38
maps.google.com.malwaredomain.com.,118.193.165.236,in,a,white,1442148766000,1442148766000,600,600,1
whois.ducmates.blogspot.com.,216.58.194.193,in,a,white,1535969280784,1535969280784,44,44,1

$ cat ../tst.awk
BEGIN { FS="," }
NR==FNR {
    query = $0
    outFile = "file" ++numQueries ".out"
    printf "" > outFile; close(outFile)
    if ( query ~ /^[^.]+[.][^.]+[.][$]$/ ) {
        # simple end of field string, can be hash matched
        queriesHash[query] = outFile
    }
    else {
        # not a simple end of field string, must be regexp matched
        queriesRes[query] = outFile
    }
    next
}
FNR>1 {
    matchQuery = ""
    if ( match($1,/[^.]+[.][^.]+[.]$/) ) {
        fldKey = substr($1,RSTART,RLENGTH) "$"
        if ( fldKey in queriesHash ) {
            matchType  = "hash"
            matchQuery = fldKey
            outFile    = queriesHash[matchQuery]
        }
    }
    if ( matchQuery == "" ) {
        for ( query in queriesRes ) {
            if ( $1 ~ query ) {
                matchType  = "regexp"
                matchQuery = query
                outFile    = queriesRes[matchQuery]
                break
            }
        }
    }
    if ( matchQuery != "" ) {
        print "matched:", matchType, matchQuery, $0, ">>", outFile | "cat>&2"
        print >> outFile; close(outFile)
    }
}

$ ls
$
$ tail -n +1 *
tail: cannot open '*' for reading: No such file or directory

$ awk -f ../tst.awk ../queries ../records
matched: hash google.com.$ maps.google.com.,173.194.112.106,in,a,white,1442011301000,1442011334000,300,300,2 >> file1.out
matched: hash google.com.$ drive.google.com.,173.194.112.107,in,a,white,1442011301000,1442011334000,300,300,2 >> file1.out
matched: hash nokiantires.com.$ nokiantires.com.,185.53.179.22,in,a,white,1529534626596,1529534626596,600,600,1 >> file2.out
matched: regexp xn--.*$ xn--c1yn36f.cn.,167.160.174.76,in,a,white,1501685257255,1515592226520,14400,14400,38 >> file4.out

$ ls
file1.out  file2.out  file3.out  file4.out  file5.out  file6.out
$
$ tail -n +1 *
==> file1.out <==
maps.google.com.,173.194.112.106,in,a,white,1442011301000,1442011334000,300,300,2
drive.google.com.,173.194.112.107,in,a,white,1442011301000,1442011334000,300,300,2

==> file2.out <==
nokiantires.com.,185.53.179.22,in,a,white,1529534626596,1529534626596,600,600,1

==> file3.out <==

==> file4.out <==
xn--c1yn36f.cn.,167.160.174.76,in,a,white,1501685257255,1515592226520,14400,14400,38

==> file5.out <==

==> file6.out <==
$

最初的printf "" > outFile; close(outFile)只是为了确保每个查询都得到一个输出文件,即使该查询不匹配,就像您在示例中所要求的一样。

如果您使用的是GNU awk,则它可以为您管理多个打开的输出文件,然后您可以进行以下更改:

  1. printf "" > outFile; close(outFile)-> printf "" > outFile
  2. print >> outFile; close(outFile)-> print > outFile

效率更高,因为这样就不会在每次打印时都打开或关闭输出文件。