Question

我想要完成的事情

我有一个包含数百个文件的文件夹，每个文件具有相同的结构，以下是一个示例：

Start Date  End Date    Code1   Code2   Vendor Identifier   Quantity    V1_1    V1_2    Currency    V1_3    ID  V1_4    V2  V3  V4  TypeID  OtherID Country_of_Sale V5  V6  V7  V8
11/27/16    12/31/16        character_value character_value 2           USD     numeric_value   character_value character_value character_value     character_value     AU              
11/27/16    12/31/16        character_value character_value 1           USD     numeric_value   character_value character_value character_value     character_value     AU              
11/27/16    12/31/16        character_value character_value 1           USD     numeric_value   character_value character_value character_value     character_value     AU                                                                                              
row count   3558                                                                                
Country_of_Sale TotalA  TotalB  TotalC  TotalD  spu TotalE  V2_1    V2_2    TotalF  V2_3    V2_4                                        
AR  0   2782223 2782223 7763.1  0.002790251 22  0.05        0.05    4626.17 5023                                        
US  0   2497603034  2497603034  2958948.67  0.001184715 111374  109.33      109.33  1763291.86  1897441                                     
DO  0   529132  529132  632.54  0.001195429 5   0.01        0.01    376.94  403                                     
EC  0   794440  794440  1669.63 0.002101644 14  0.02        0.02    994.96  1087                                        
BR  0   24397952    24397952    57932.77    0.002374493 217 0.43        0.43    34523.2 37225                                       
Ctotal  109.84                                                                              
Stotal  5680.38                                                                             
Total   5790.22

如您所见，每个文件应该是两个单独的文件;一个带标题行

Start Date  End Date    Code1   Code2   Vendor Identifier   Quantity    V1_1    V1_2    Currency    V1_3    ID  V1_4    V2  V3  V4  TypeID  OtherID Country_of_Sale V5  V6  V7  V8

和一个标题行

Country_of_Sale TotalA  TotalB  TotalC  TotalD  spu TotalE  V2_1    V2_2    TotalF  V2_3    V2_4

将这两者分开的行总是有$ 1 ==行数（/ ^行数/？）

我想要两个结果文件，一个用于上面描述的每个标题行。但还有几百个文件 - 所有这些文件都放在一个目录中 - 从中提取这些文件：

问题

我知道我的解决方案在于awk。我不知道awk。我已经研究了几个小时了，我已经想出如何解决这个问题的不同部分，但是我们无法弄清楚如何将它们整合在一起。

我最终需要的是两个表，我可以在Country_of_Sale上加入（在SQL中）。

预期结果

简单：

文件1：

Start Date  End Date    UPC ISRC/ISBN   Vendor Identifier   Quantity    V1_1    V1_2    Currency    V1_3    ID  V1_4    V2  V3  V4  TypeID  OtherID Country_of_Sale V5  V6  V7  V8
    11/27/16    12/31/16        character_value character_value 2           USD     numeric_value   character_value character_value character_value     character_value     AU              
    11/27/16    12/31/16        character_value character_value 1           USD     numeric_value   character_value character_value character_value     character_value     AU              
    11/27/16    12/31/16        character_value character_value 1           USD     numeric_value   character_value character_value character_value     character_value     AU

file2的

Country_of_Sale TotalA  TotalB  TotalC  TotalD  spu TotalE  V2_1    V2_2    TotalF  V2_3    V2_4                                        
    AR  0   2782223 2782223 7763.1  0.002790251 22  0.05        0.05    4626.17 5023                                        
    US  0   2497603034  2497603034  2958948.67  0.001184715 111374  109.33      109.33  1763291.86  1897441                                     
    DO  0   529132  529132  632.54  0.001195429 5   0.01        0.01    376.94  403                                     
    EC  0   794440  794440  1669.63 0.002101644 14  0.02        0.02    994.96  1087                                        
    BR  0   24397952    24397952    57932.77    0.002374493 217 0.43        0.43    34523.2 37225

我尝试过的（按要求:)）

我从这开始：

gawk '
  /^row count/ {nextfile}
  NR == 1 {$0 = "Filename" OFS $0; print} 
  FNR > 1 {$0 =  FILENAME OFS $0; print}
' OFS='\t' dir/to/raw/files/*.txt > dir/to/munged/file/file1.txt

和

gawk 'FNR==1,/^Country_Of_Sale/{next} /^CTotal/ {nextfile} 
{ $0 =  FILENAME OFS $0; print }' OFS='\t' dir/to/raw/files/*.txt > dir/to/munged/file/file2.tsv

哪种方式有效，但我想在一行中完成。

所以我对这种不同的排列搞砸了：

awk -F, '{print > $1}' file1

但说实话，我真的不明白。我对已经纠缠不清的数据更加舒服。

我希望我在这里给予足够的支持。我当然不想利用这个资源。

Answer 1

假设您的文件具有.txt扩展名，并且您想要使用.txt.1或.txt.2扩展名命名生成的文件，您可以尝试以下内容：

awk 'BEGINFILE{f=FILENAME".1"} /^row count/{f=FILENAME".2";next} /^Ctotal/{nextfile} {print>f}' *.txt

说明：

在处理每个输入文件的开始时，变量f设置为FILENAME.1，其中FILENAME（awk buit-in变量）是当前的名称处理文件。
当前输入文件的当前行以row count开头时，变量f设置为FILENAME.2并跳过该行。
当前输入文件的当前行以Ctotal开头时，将跳过该文件的其余部分。
变量f用作所有非跳过行的输出文件名。

拆分基于行结构的文件夹中的所有文件

我想要完成的事情

问题

预期结果

文件1：

file2的

我尝试过的（按要求:)）

1 个答案: