如何打破大型#34; .csv"根据多个条件将文件归档为小文件?

时间:2015-08-20 09:35:22

标签: linux excel bash csv command-line

我有大的.csv文件(~40MB),我想在几个条件下将它们分成较小的文件,并根据数据命名:

  1. 按第3栏的内容分隔文件,
  2. 按内容第4列分别输出第1点,
  3. 这里有一个棘手的部分:

    1. 通过2个先前操作创建的输出检查第11列中是否有任何数据,如果是,则将该数据相应地分离到内容,然后将其与第17列的内容分开 - >然后保存输出 / OR / AND /
    2. 如果第11列中没有数据,请检查第15列并相应地分开。接下来检查17列并将该数据分成第17列 - >保存输出。
    3. 我在VBA中有这样的东西,但对于大文件来说太慢了,而excel有时会崩溃。对于这样的多个文件,需要花费很长时间才能手动切断它们,然后将其放入工作中。

      这可以在很多条件下剪切文件吗?

      提前感谢您的帮助。

      的exaple: (标题是列的#)

      1       2   3   4   11  15  17
      Date        Time    COUNTRY CITY    CHECK   TEST    TEST2
      2015-08-20  11:54   ENGLAND ABINGDON        1       1
      2015-08-21  12:54   ENGLAND BATLEY          2       5
      2015-08-22  13:54   ENGLAND FROME           2       6
      2015-08-23  14:54   ENGLAND FROME   2       1
      2015-08-24  15:54   USA CALIFORNIA          4       8
      2015-08-25  16:54   USA CONNECTICUT         4       9
      2015-08-26  17:54   USA DELAWARE    1               8
      2015-08-27  18:54   GERMANY SAXONY          6       9
      2015-08-28  19:54   GERMANY SAXONY          6       10
      2015-08-27  18:54   GERMANY SAXONY          4       11
      2015-08-28  19:54   GERMANY SAXONY          4       14
      2015-08-29  20:54   GERMANY HESSE                   8
      2015-08-29  20:54   GERMANY HESSE   1               8
      
      File1                       
      2015-08-20  11:54   ENGLAND ABINGDON        1       1
      
      File2                       
      2015-08-21  12:54   ENGLAND BATLEY          2       5
      
      File3                       
      2015-08-22  13:54   ENGLAND FROME           2       6
      
      File4                       
      2015-08-23  14:54   ENGLAND FROME   2               1
      
      File5                       
      2015-08-24  15:54   USA CALIFORNIA          4       8
      
      File6                       
      2015-08-25  16:54   USA CONNECTICUT         4       9
      
      File7                       
      2015-08-26  17:54   USA DELAWARE    1               8
      
      File8                       
      2015-08-27  18:54   GERMANY SAXONY          4       9
      2015-08-28  19:54   GERMANY SAXONY          4       10
      
      File9                       
      2015-08-27  18:54   GERMANY SAXONY          6       11
      2015-08-28  19:54   GERMANY SAXONY          6       14
      
      File10                      
      2015-08-29  20:54   GERMANY HESSE                   8
      
      File11                      
      2015-08-29  20:54   GERMANY HESSE   1               8
      

3 个答案:

答案 0 :(得分:0)

您的数据到处都是!它不在您描述的列中,也不是以制表符分隔的。你不能让生活变得轻松!

尝试使用您的真实数据awk来查看它是否会生成您可以使用的输出文件名。

awk -F'\t' '{
    f=$3 "_" $4                # filename = field3 _ field4
    if(length($11)){           # if f11 not null
      f=f "_A_" $11 "_" $17    #    filename = filename _A_ field11 _ field17
    }else{                     # else
      f=f "_B_" $15 "_" $17    #    filename = filename _B_ field15 _ field17
    }
    print f}' file.csv

你应该得到这样的东西

ENGLAND_ABINGDON_A_3_1
ENGLAND_ABINGDON_A_4_2
GENRMANY_SAXONY_B_5_3

基本上它使用awk并告诉它你的字段分隔符是标签。然后,它会查看每一行,并通过查看您描述的字段在变量f中构建输出文件名。

如果你的数据分开了你的意思,你可以通过简单地改变最后一行来实际输出当前行到相应名称的文件:

awk -F'\t' '{
    f=$3 "_" $4                # filename = field3 _ field4
    if(length($11)){           # if f11 not null
      f=f "_A_" $11 "_" $17    #    filename = filename _A_ field11 _ field17
    }else{                     # else
      f=f "_B_" $15 "_" $17    #    filename = filename _B_ field15 _ field17
    }
    print > f}' file.csv

基本上,如果您更改

,它会将打印到文件,而不是打印其名称
print f

print > f

编制标题

如果你想在每个输出fie上有一个标题,我们需要更努力地工作......

首先,我们需要保存原始文件中的标题,因此如果我们假设它是记录号1,我们将会这样做

awk -F'\t' '
    NR==1 {header=$0}           # save first line as header
    {f=$3 "_" $4                # filename = field3 _ field4
    ...
    ... as before

现在我们需要在开始写入新文件时输出标题行,这是" fun" 因为我们只是为每一行动态创建输出文件名!因此,我们需要"记住" 我们写入的文件,然后在我们写入新文件时只发出标题。我在这里没有一套像样的数据,所以我猜这一点!

awk -F'\t' '
    NR==1 {header=$0}          # save first line as header
    {f=$3 "_" $4               # filename = field3 _ field4
    if(length($11)){           # if f11 not null
      f=f "_A_" $11 "_" $17    #    filename = filename _A_ field11 _ field17
    }else{                     # else
      f=f "_B_" $15 "_" $17    #    filename = filename _B_ field15 _ field17
    }
    # Emit header if first write to this filename
    if(!(f in fileswritten)){
       fileswritten[f]++         # note that we have written to this file
       print header > f          # emit header
    }
    print > f}' file.csv

答案 1 :(得分:0)

这个答案不完整,但大致说明了你需要做的事情:

#!/bin/bash

# Get list of countries:
countries=`cat file1.csv | cut -f 3 -d$'\t'| grep -v 3 | grep -v COUNTRY | uniq`

for country in ${countries}; do
    # Get list of cities per country:
    cities=`cat file1.csv | grep ${country} | cut -f 4 -d$'\t' | uniq`

    # Get data per country:
    cat file1.csv | grep ${country} > file1-${country}.csv

    # Get data per city per country:
    for city in ${cities}; do
    echo ${country}:${city}

    cat file1.csv | grep ${country} | grep ${city} > file1-${country}-${city}.csv
    done

    # Created output by 2 previous operations check if there is any data in 11th column,
    # if yes then separate this data accordingly to content and after that separate that
    # by content of 17th column -> then save outputs /OR / AND /
    # Column 11 is at position 5 in your data.
    check=`cat file1.csv | grep ${country} | cut -f 5 -d$'\t' | uniq`
    for check in ${checks}; do
        echo ${country}:${city}:${check}

        cat file1.csv | grep ${country} | grep ${city} > file1-${country}-${city}-${check}.csv

        # TODO: Further split this, I assume you get the drift by now.
    done

    # If there is no data in column 11 check column 15th and separate accordingly.
    # Next check 17 column and separate this data by 17th column -> save outputs.
    # TODO: Further split this, I assume you get the drift by now.

done

答案 2 :(得分:0)

我建议编写一个小脚本并使用java库CSVFormat:

private static final String[] FILE_HEADER_MAPPING = {"Date", "Time" ,"COUNTRY", .... };
csvFileParser = new CSVParser(fileReader, csvFileFormat);
        List<CSVRecord> csvRecords = csvFileParser.getRecords();

然后访问第11列,你必须

 for (int i = 1; i < csvRecords.size(); i++) {
    boolean publishAccount = true;
    CSVRecord record = csvRecords.get(i);
    /// here how to access
    record.get("Fiel column 11");  
 }