如何使用AWK解决此2表数据联接?

时间:2018-07-10 17:41:33

标签: perl awk filemaker

我有2个数据表,如图所示(它们是2个制表符分隔的文件)。 我正在尝试使用表1中的相应国家/地区填充“表2国家/地区”列。需要从表2的“名字”字段中的信息中“加入”。

2 INPUT TABLES & DESIRED RESULT

鉴于表2“名字”列中数据的复杂性,这里最好的方法是什么?其他Mac工具比AWK更好用吗? Excel公式,Perl,Filemaker等?

TABLE1(输入):

city_ascii  country iso2
Mavinga Angola  AO
Menongue    Angola  AO
Mucusso Angola  AO
Guines  Cuba    CU
Havana  Cuba    CU
Holguin Cuba    CU
Las Tunas   Cuba    CU
Manzanillo  Cuba    CU
Matanzas    Cuba    CU
Moron   Cuba    CU
Santa Clara Cuba    CU
Varadero    Cuba    CU

TABLE2(输入):

Firstname
Fred, Havana
James, (Varadero, Cuba)
Jack (Cuba)
Harry Varadero, Cuba
Josh Cuba
Gary, Mavinga & Other, Angola
Jamie, (Angola)

TABLE2(结果):

Firstname   Country
Fred, Havana  Cuba
James, (Varadero, Cuba) Cuba
Jack (Cuba) Cuba
Harry Varadero, Cuba    Cuba
Josh Cuba   Cuba
Gary, Mavinga & Other, Angola   Angola
Jamie, (Angola) Angola

============= 以下是针对Ed的以下问题的调试信息:

awk -F'\t' '{print NF"<"$1"><"$2"><"$3">"}' Table3.txt | cat -v

    1<city_ascii  country iso2><><>
    1<Mavinga Angola  AO><><>
    1<Menongue    Angola  AO><><>
    1<Mucusso Angola  AO><><>
    1<Guines  Cuba    CU><><>
    1<Havana  Cuba    CU><><>
    1<Holguin Cuba    CU><><>
    1<Las Tunas   Cuba    CU><><>
    1<Manzanillo  Cuba    CU><><>
    1<Matanzas    Cuba    CU><><>
    1<Moron   Cuba    CU><><>
    1<Santa Clara Cuba    CU><><>
    1<Varadero    Cuba    CU><><>

    ==============
    awk -F'\t' '{print NF"<"$1"><"$2"><"$3">"}' Table4.txt | cat -v

    1<Firstname><><>
    1<Fred, Havana><><>
    1<James, (Varadero, Cuba)><><>
    1<Jack (Cuba)><><>
    1<Harry Varadero, Cuba><><>
    1<Josh Cuba><><>
    1<Gary, Mavinga & Other, Angola><><>
    1<Jamie, (Angola)><><>

    ===============
    cat -v tst.awk

    BEGIN { FS=OFS="\t" }
    NR==FNR {
        map[$1] = $2
        map[$2] = $2
        next
    }
    FNR==1 {
        print
        FS=" "
        next
    }
    {
        orig = $0
        country = ""
        gsub(/[^[:alpha:]]/," ")
        for (i=NF; i>0; i--) {
            if ($i in map) {
                country = map[$i]
                break
            }
        }
        print orig, country
    }

    ===============
    awk -f tst.awk Table3.txt Table4.txt >output.txt

    Firstname
    Fred, Havana    
    James, (Varadero, Cuba) 
    Jack (Cuba) 
    Harry Varadero, Cuba    
    Josh Cuba   
    Gary, Mavinga & Other, Angola   
    Jamie, (Angola) 

    ================
    awk -F'\t' '{print NF"<"$1"><"$2"><"$3">"}' output.txt | cat -v

    1<Firstname><><>
    2<Fred, Havana><><>
    2<James, (Varadero, Cuba)><><>
    2<Jack (Cuba)><><>
    2<Harry Varadero, Cuba><><>
    2<Josh Cuba><><>
    2<Gary, Mavinga & Other, Angola><><>
    2<Jamie, (Angola)><><>

3 个答案:

答案 0 :(得分:6)

use DBI qw();
require DBD::CSV;
use List::Util 1.45 qw(uniq);

chdir '/tmp'; # location of csv files
my $dbh = DBI->connect("dbi:CSV:", undef, undef, {
    f_ext => '.csv',
    csv_sep_char => "\t",
    RaiseError => 1,
}) or die "Cannot connect: $DBI::errstr";

for my $country (
    uniq map { $_->[0] }
    # sql distinct not implemented
    $dbh->selectall_array('select country from table1')
) {
    $dbh->do(
        'update table2 set Country = ? where Firstname like ' .
            $dbh->quote("%$country%"),
        {},
        $country
    );
}

答案 1 :(得分:1)

如果我了解您在做什么,它将使用此\t分隔文件的第一列(城市)和第二列(国家):

city_ascii  country iso2
Mavinga Angola  AO
Menongue    Angola  AO
Mucusso Angola  AO
Guines  Cuba    CU
Havana  Cuba    CU
Holguin Cuba    CU
Las Tunas   Cuba    CU
Manzanillo  Cuba    CU
Matanzas    Cuba    CU
Moron   Cuba    CU
Santa   Clara   Cuba    CU
Varadero    Cuba    CU

并将此文件中的字符串与此单列文件匹配:

Firstname
Fred, Havana, Cuba
James, (Varadero, Cuba)
Jack (Cuba)
Harry Varadero, Cuba
Josh Cuba
Gary, Mavinga & Other, Angola
Jamie, (Angola)

在您的示例中生成两列文件。

awk这样做:

awk -F '\t' 'FNR==NR{city[$1]=$2; ct[$2]; next}
             # ^^ FNR==NR means it is the first file; set city and country      
     FNR==1 {printf "%s\t%s\n", $0,"Country"; next}
     # ^^   second file, first line - print the header   
     {split($0, arr, /[^[:alpha:]]/)
      # ^ split word like things from paren, punctuation, etc
      for (e in arr) {s=arr[e]   # loop over those words
                      if (s in city) { printf "%s\t%s\n", $0,city[s]; next }
                      # ^ a city? print that
                      if (s in ct) { printf "%s\t%s\n", $0,s; next }}
                      # ^ a country? print that
                      }' file1 file2
Firstname   Country
Fred, Havana    Cuba
James, (Varadero, Cuba) Cuba
Jack (Cuba) Cuba
Harry Varadero, Cuba    Cuba
Josh Cuba   Cuba
Gary, Mavinga & Other, Angola   Angola
Jamie, (Angola) Angola

next语句告诉awk转到文件的下一行。

答案 2 :(得分:1)

听起来这可能就是您要寻找的东西

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
    map[$1] = $2
    map[$2] = $2
    next
}
FNR==1 {
    print
    FS=" "
    next
}
{
    orig = $0
    country = ""
    gsub(/[^[:alpha:]]/," ")
    for (i=NF; i>0; i--) {
        if ($i in map) {
            country = map[$i]
            break
        }
    }
    print orig, country
}

$ awk -f tst.awk file1 file2
Firstname       Country
Fred, Havana    Cuba
James, (Varadero, Cuba) Cuba
Jack (Cuba)     Cuba
Harry Varadero, Cuba    Cuba
Josh Cuba       Cuba
Gary, Mavinga & Other, Angola   Angola
Jamie, (Angola) Angola