如果Id列匹配,则合并两个csv文件

时间:2017-10-25 22:39:23

标签: bash csv command-line text-processing

我有以下内容:

file1.csv

"Id","clientName1","clientName2"

file2.csv

"Id","Name1","Name2"

我想按顺序阅读 file1 。对于每条记录,我想检查 file2 中是否存在匹配的Id。可能有多个比赛。对于每个匹配,我想将Name1, Name2附加到 file1.csv

记录的末尾

所以,如果记录在 file2 中有多个匹配,可能会产生结果:

"Id","clientName1","clientName2","Name1","Name2","Name1","Name2"

4 个答案:

答案 0 :(得分:0)

我担心bash可能不是有效的解决方案,但以下bash脚本可以工作:

#!/bin/bash

declare -A id_hash

while read line; do
    id=$(echo $line | cut -d ',' -f 1)
    name=$(echo $line | cut -d ',' -f 2-)
    if [ -z "${id_hash[$id]}" ]; then
        id_hash[$id]=$name
    else
        id_hash[$id]=${id_hash[$id]},$name
    fi
done < file1.csv

while read line; do
    id=$(echo $line | cut -d ',' -f 1)
    name=$(echo $line | cut -d ',' -f 2-)
    if [ -z "${id_hash[$id]}" ]; then
        id_hash[$id]=$name
    else
        id_hash[$id]=${id_hash[$id]},$name
    fi
done < file2.csv

for id in ${!id_hash[@]}; do
    echo $id,${id_hash[$id]}
done

答案 1 :(得分:0)

作为对OP's clarification in his/her comment的回复,以下是 awk命令的修订版,如果在file1或file2中存在重复的ID,则会合并两者以及是否具有不同数量的字段。 old version which it works for OP's current stated question

awk -F',' '{one=$1;$1="";a[one]=a[one]$0} END{for (i in a) print i""a[i]}' OFS=, file[12]

输入:

  

<强>文件1

"Id1","clientN1","clientN2"
"Id2","Name3","Name4"
"Id3","client00","client01","client02"
"Id1","client1","client2","client3"
     

<强> file2的

"Id1","Name1","Name2"
"Id1","Name3","Name4"
"Id2","Name0","Name1"
"Id2","Name00","Name11","Name22"

输出在同一 ID 上合并 file1 file2

"Id1","clientN1","clientN2","client1","client2","client3","Name1","Name2","Name3","Name4"
"Id2","Name3","Name4","Name0","Name1","Name00","Name11","Name22"
"Id3","client00","client01","client02"

答案 2 :(得分:0)

使用joinGNU sed

的正则表达式解决方案
join -t , -a 1 file[12].csv | sed -r '$!N;/^(.*,)(.*)\n\1/!P;s//\n\1\2,/;D'

假设file1.csv和file2.csv都按id排序,没有标题

<强> file1.csv

1,c11,c12
2,c21,c22
3,c31,c32

<强> file2.csv

1,n11,n12
1,n21,n22
1,n31,n32
2,n41,n42

给出

的结果
1,c11,c12,n11,n12,n21,n22,n31,n32
2,c21,c22,n41,n42
3,c31,c32

<强>更新

如果file1.csv可能包含重复ID 各种字段长度,我建议执行预处理以确保{{1}在加入file1.csv

之前是干净的
file2.csv
  • 第一个awk进程将所有数据拆分为(id,name)对
  • awk -F, '{for(i=2;i<=NF;i++) print $1 FS $i}' file1.csv |\ sort -u |\ sed -r '$!N;/^(.*,)(.*)\n\1/!P;s//\n\1\2,/;D' 对每对进行排序和取消配对
  • 最后一个sed进程将具有相同ID的所有对合并为一行

<强>输入

sort -u

<强>输出

1,c11,c12
1,c12,c14,c13
1,c15,c12
2,c21,c22

答案 3 :(得分:0)

感谢所有人,但已经完成了。我写的代码如下:

#!/bin/bash

echo
echo 'Merging files into one'

IFS=","
while read id lname fname dnaid status type program startdt enddt ref email dob age add1 add2 city postal phone1 phone2

do
var="$dnaid,$lname,$fname,$status,$type,$program,$startdt,$enddt,$ref,$email,$dob,$age,$add1,$add2,$city,$postal,$phone1,$phone2"

  while read id2 cwlname cwfname
  do
       if [ $id == $id2 ]
       then
           var="$var,$cwlname,$cwfname"
       fi

  done < file2.csv

  echo "$var" >> /root/scijoinedfile.csv

done < file1.csv

echo
echo "Merging completed"