Question

我有两个数据文件。一个是1600行，另一个是200万行（制表符分隔文件）。我需要在这两个文件之间进行vlookup。请参阅下面的示例了解预期的输出，如果可能，请告诉我。我尝试过使用awk，但无法获得预期的结果。

文件1（小文件）
BC1 10 100
BC2 20 200
BC3 30 300

文件2（大文件）
BC1 XYZ
BC2 ABC
BC3 DEF

预期产出：
BC1 10 100 XYZ
BC2 20 200 ABC
BC3 30 300 DEF

我也试过了join命令。它需要永远完成。请帮我找一个解决方案。感谢

Answer 1

输出命令：

awk'{print $ 1}'*文件|排序uniq -d> out.txt
对于$（cat out.txt）中的i 做 grep“ $ i”大文件>> temp.txt 完成排序-g -t 1 temp.txt> out1.txt
排序-g -t 1 out.txt> out2.txt
粘贴out1.txt out2.txt | awk'{print $ 1 $ 2 $ 3 $ 5}'

Vlookup的命令

分别将第一列和第二列存储在file1 file2中

cat file1 file2 |排序uniq -d ###表示两个文件中都存在的记录

cat file1 file2 |排序uniq -u ###用于记录唯一且不存在于批量文件中的记录

Answer 2

这个awk脚本将逐行扫描每个文件，并尝试匹配BC列中的数字。匹配后，它将打印所有列。如果其中一个文件不包含其中一个数字，则将在两个文件中跳过该文件并搜索下一个文件。它将循环，直到其中一个文件结束。该脚本还接受每个文件和任意数量的文件的任意数量的列，只要第一列是BC和数字。 此awk脚本假定文件是从BC列中的次要编号排序到主编号（如示例中所示）。否则它将无法工作。

要执行脚本，请运行以下命令：

awk -f vlookup.awk smallfile bigfile

vlookup.awk文件将包含以下内容：

BEGIN {files=1;lines=0;maxlines=0;filelines[1]=0; 

#Number of columns for SoD, PRN, reference file

col_bc=1;

#Initialize variables
bc_now=0;

new_bc=0;

end_of_process=0;

aux="";
text_result="";
}
{
if(FILENAME!=ARGV[1])exit;

no_bc=0;
new_bc=0;

#Save number of columns
NFields[1]=NF;

#Copy reference file data
for(j=0;j<=NF;j++) 
{
    file[1,j]=$j;
}

#Read lines from file
for(i=2;i<ARGC;i++)
{
    ret=getline < ARGV[i];
    if(ret==0) exit; #END OF FILE reached
    #Copy columns to file variable
    for(j=0;j<=NF;j++) 
    {
        file[i,j]=$j;
    }
    #Save number of columns
    NFields[i]=NF;

}

#Check that all files are in the same number
for(i=1;i<ARGC;i++) 
{
    bc[i]=file[i,col_bc];
    bc[i]=sub("BC","",file[i,col_bc]);
    if(bc[i]>bc_now) {bc_now=bc[i];new_bc=1;}       
}

#One or more files have a new number
if (new_bc==1)
{
    for(i=1;i<ARGC;i++)
    {
        while(bc_now!=file[i,col_bc])
        {
            #Read next line from file
            if(i==1) ret=getline; #File 1 is the reference file
            else ret=getline < ARGV[i];
            if(ret==0) exit; #END OF FILE reached
            #Copy columns to file variable
            for(j=0;j<=NF;j++) 
            {
                file[i,j]=$j;
            }
            #Save number of columns
            NFields[i]=NF;
            #Check if in current file data has gone to next number
            if(file[i,col_bc]>bc_now) 
            {
                no_bc=1;
                break;  
            }
            #No more data lines to compare, end of comparison
            if(FILENAME!=ARGV[1])
            {
                exit;
            }
        }
        #If the number is not in a file, the process to realign must be restarted to the next number available (Exit for loop)
        if (no_bc==1) {break;}
    }
    #If the number is not in a file, the process to realign must be restarted to the next number available (Continue while loop)
    if (no_bc==1) {next;}

} 


#Number is aligned
for(i=1;i<ARGC;i++)
{
    for(j=2;j<=NFields[i];j++) {

        #Join colums in text_result variable
        aux=sprintf("%s %s",text_result,file[i,j]);
        text_result=sprintf("%s",aux);

    }
}
printf("BC%d%s\n",bc_now,text_result)
#Reset text variables
aux="";
text_result=""; 



}

Answer 3

我也试过了join命令。它需要永远完成。请帮我找到解决方案。

您不太可能找到比编译的join命令更快的解决方案（脚本或非脚本）。如果您不能等待join完成，则需要更强大的硬件。

在使用awk的linux中的vlookup中需要帮助

3 个答案: