从目录中的多个csv创建单个CSV第一个CSV复制两个列后续csv仅第二列

时间:2016-03-31 00:00:35

标签: php html csv awk sed

我希望从目录中的许多csv创建一个单独的csv。我知道这已被多次覆盖,但我有一点点扭曲。我想做的事情:

  1. 找到最大的文件。
  2. 使用最大的文件 - 以此为基础。最大文件中的第一列将是我需要合并其余文件的主键。
  3. 将目录中的每个文件与第一个CSV中的主键进行比较,并将每个csv的第二列添加到最大的一个。
  4. 据说我正在使用以下内容:

    我发现这个链接从一个csv到另一个csv。

    https://askubuntu.com/questions/553219/add-column-from-one-csv-to-another-csv-file

    我可以利用这样的东西将列从一个添加到另一个。

    paste -d, file2 <(cut -d, -f3- file1)
    

    以下PHP将获取目录的文件列表,该目录现在尝试利用PHP来组合/合并csvs。

    $dir= $Folder.'/Stats/Latency/'; // directory name 
    $ar=scandir($dir); 
    $box=$_POST['box'];  // Receive the file list from form
    
    // Looping through the list of selected files ///
    while (list ($key,$val) = @each ($box)) {
    $path=$dir  ."/".$val;
    $dest = $Folder."/Report/Latency/".$val;
    if(copy($path, $dest)); //echo "Copy Complete file ";
    echo "$val,";
    }
    echo "<hr>";
    

    这是我需要下面的CSV合并的地方: 我正在讨论使用shell exec命令,但这似乎非常耗费人力。

    $reportFiles = $Folder."/Report/Latency/";
    foreach(glob($reportFiles."*.csv") as $file)
    {
       shell_exec("touch "$reportFiles."latencyReport.csv");
    
    }
    

    因为它与csv文件中的数据有关:

    CSV1:

    date,vpool06
    2016-03-28 12:00:00,0.000
    2016-03-28 12:01:00,0.000
    2016-03-28 12:02:00,0.000
    2016-03-28 12:03:00,0.000
    2016-03-28 12:04:00,0.000
    2016-03-28 12:05:00,0.000
    2016-03-28 12:06:00,0.000
    2016-03-28 12:07:00,0.000
    2016-03-28 12:08:00,0.000
    2016-03-28 12:09:00,0.000
    2016-03-28 12:10:00,0.000
    2016-03-28 12:11:00,0.000
    2016-03-28 12:12:00,0.000
    2016-03-28 12:13:00,0.000
    2016-03-28 12:14:00,0.000
    2016-03-28 12:15:00,0.000
    2016-03-28 12:16:00,0.000
    2016-03-28 12:17:00,0.000
    2016-03-28 12:18:00,0.000
    2016-03-28 12:19:00,0.000
    

    CSV2:

    date,vpool02
    2016-03-28 12:00:00,0.000
    2016-03-28 12:01:00,0.000
    2016-03-28 12:02:00,0.000
    2016-03-28 12:04:00,0.000
    2016-03-28 12:05:00,0.000
    2016-03-28 12:06:00,0.000
    2016-03-28 12:07:00,0.000
    2016-03-28 12:08:00,0.000
    2016-03-28 12:09:00,0.000
    2016-03-28 12:10:00,0.000
    2016-03-28 12:11:00,0.000
    2016-03-28 12:12:00,0.000
    2016-03-28 12:13:00,0.000
    2016-03-28 12:14:00,0.000
    

    CSV3:

    date,vpool03
    2016-03-28 12:00:00,0.000
    2016-03-28 12:01:00,0.000
    2016-03-28 12:02:00,0.000
    2016-03-28 12:04:00,0.000
    2016-03-28 12:05:00,0.000
    

    合并CSV:

    date,vpool06,vpool02,vpool03
    2016-03-28 12:00:00,0.000,0.000,0.000
    2016-03-28 12:01:00,0.000,0.000,0.000
    2016-03-28 12:02:00,0.000,0.000,0.000
    2016-03-28 12:03:00,0.000,,0.000
    2016-03-28 12:04:00,0.000,0.000,0.000
    2016-03-28 12:05:00,0.000,0.000,0.000
    2016-03-28 12:06:00,0.000,0.000,
    2016-03-28 12:07:00,0.000,0.000,
    2016-03-28 12:08:00,0.000,0.000,
    2016-03-28 12:09:00,0.000,0.000,
    2016-03-28 12:10:00,0.000,0.000,
    2016-03-28 12:11:00,0.000,0.000,
    2016-03-28 12:12:00,0.000,0.000,
    2016-03-28 12:13:00,0.000,0.000,
    2016-03-28 12:14:00,0.000,0.000,
    2016-03-28 12:15:00,0.000,,
    2016-03-28 12:16:00,0.000,,
    2016-03-28 12:17:00,0.000,,
    2016-03-28 12:18:00,0.000,,
    2016-03-28 12:19:00,0.000,,
    

    理想情况下,我不关心此时是否存在“null”值,因为它不会显示在图表中。这意味着当时服务器已关闭。

    需要在没有数据的空格中使用null 更新:示例。

    date,vpool06,7NA_01,7NA_02,bd01,bd02,vpool01,vpool02,vpool03,vpool04,vpool07
    2016-03-28 12:00:00,1.000,null,10.00,02.00,20.00,0.00,0.00,0.00,0.00,0.000
    2016-03-28 12:01:00,0.000,11.00,110.00,null,11.00,0.00,0.00,0.00,0.00,0.000
    2016-03-28 12:02:00,0.000,null,0.00,2.00,100,0.00,0.00,0.00,0.00,0.000
    2016-03-28 12:03:00,0.000,0.00,0.00,02.00,10.00,0.00,0.000,0.00,0.00,0.000
    

2 个答案:

答案 0 :(得分:1)

awk救援!

$ awk -F, -v OFS=, 'FNR==1{c++} {a[$1,c]=$2;keys[$1]}
                       END{for(k in keys) 
                            {printf "%s", k; 
                             for(i=1;i<=c;i++) 
                                 printf "%s", OFS (((k,i) in a)?a[k,i]:""); 
                             print ""}}' file{1,2,3} | 
 sort -t, -k1,1 | 
 tee >(sed '$d' > merged) >(tail -1 >> merged) 

$ cat merged

date,vpool06,vpool02,vpool03                                                                                          
2016-03-28 12:00:00,0.000,0.000,0.000                                                                                 
2016-03-28 12:01:00,0.000,0.000,0.000
2016-03-28 12:02:00,0.000,0.000,0.000
2016-03-28 12:03:00,0.000,,
2016-03-28 12:04:00,0.000,0.000,0.000
2016-03-28 12:05:00,0.000,0.000,0.000
2016-03-28 12:06:00,0.000,0.000,
2016-03-28 12:07:00,0.000,0.000,
2016-03-28 12:08:00,0.000,0.000,
2016-03-28 12:09:00,0.000,0.000,
2016-03-28 12:10:00,0.000,0.000,
2016-03-28 12:11:00,0.000,0.000,
2016-03-28 12:12:00,0.000,0.000,
2016-03-28 12:13:00,0.000,0.000,
2016-03-28 12:14:00,0.000,0.000,
2016-03-28 12:15:00,0.000,,
2016-03-28 12:16:00,0.000,,
2016-03-28 12:17:00,0.000,,
2016-03-28 12:18:00,0.000,,
2016-03-28 12:19:00,0.000,,

答案 1 :(得分:1)

我不知道你是如何在PHP中做到的,但是对于真正的2D数组使用GNU awk并且在“in”中排序它会是:

$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 { hdr[ARGIND][1]=$1; hdr[ARGIND][2]=$2; next }
{ arr[ARGIND][$1] = $2 }
END {
    for (idx in arr) {
        numRows = length(arr[idx])
        if (numRows > maxRows) {
            maxRows = numRows
            maxIdx  = idx
        }
    }

    printf "%s%s%s", hdr[maxIdx][1], OFS, hdr[maxIdx][2]
    for (idx=1; idx<=ARGIND; idx++) {
        if (idx != maxIdx) {
            printf "%s%s", OFS, hdr[idx][2]
        }
    }
    print ""

    PROCINFO["sorted_in"] = "@ind_str_asc"
    for (tstamp in arr[maxIdx]) {
        printf "%s%s%s", tstamp, OFS, arr[maxIdx][tstamp]
        for (idx=1; idx<=ARGIND; idx++) {
            if (idx != maxIdx) {
                printf "%s%s", OFS, (tstamp in arr[idx] ? arr[idx][tstamp] : "null")
            }
        }
        print ""
    }
}

$ awk -f tst.awk csv3 csv2 csv1
date,vpool06,vpool03,vpool02
2016-03-28 12:00:00,0.000,0.000,0.000
2016-03-28 12:01:00,0.000,0.000,0.000
2016-03-28 12:02:00,0.000,0.000,0.000
2016-03-28 12:03:00,0.000,null,null
2016-03-28 12:04:00,0.000,0.000,0.000
2016-03-28 12:05:00,0.000,0.000,0.000
2016-03-28 12:06:00,0.000,null,0.000
2016-03-28 12:07:00,0.000,null,0.000
2016-03-28 12:08:00,0.000,null,0.000
2016-03-28 12:09:00,0.000,null,0.000
2016-03-28 12:10:00,0.000,null,0.000
2016-03-28 12:11:00,0.000,null,0.000
2016-03-28 12:12:00,0.000,null,0.000
2016-03-28 12:13:00,0.000,null,0.000
2016-03-28 12:14:00,0.000,null,0.000
2016-03-28 12:15:00,0.000,null,null
2016-03-28 12:16:00,0.000,null,null
2016-03-28 12:17:00,0.000,null,null
2016-03-28 12:18:00,0.000,null,null
2016-03-28 12:19:00,0.000,null,null