将csv文件与bash \ awk \ shell进行比较

时间:2014-11-27 09:23:41

标签: bash shell awk

我编写了执行csv文件之间比较的脚本。 但我还有问题 我需要永远是

5个值 - 空格 - 5个值

问题是有些行只包含4个值,所以我需要添加而不是缺少值空间columum

输入:

File1中:

1,1,1,1
3,3,3,3,3

文件2:

2,2,2,2
4,4,4,4,4

现在结果如下:

1,1,1,1, ,2,2,2,2
3,3,3,3,3, ,4,4,4,4,4

我需要的结果如下:

1,1,1,1, , , 2,2,2,2,*space* 
3,3,3,3,3, ,4,4,4,4,4

这是我的代码:

#! /bin/bash

#------------------------------------------------------------------------------
#
# Description: Joins the files vartically based on the file extensions.
#
# Usage      : ./joinfile directory1 directory2
#
#------------------------------------------------------------------------------

#---- Variables ---------------------------------------------------------------

resultfile="resultfile.csv"

#---- Main --------------------------------------------------------------------

# Checking if two arguments are provided, if not, display usage info, and exit.
if [ "$#" -ne 2 ]
then
   echo "Usage: $0 directory1 directory2"
   exit 1
fi

# Checking if any of the arguments provided is not a directory.
if [ ! -d "$1" -o ! -d "$2" ]
then
   if [ ! -d "$1" ]
   then
      echo "Error: $1 is not a valid directory"
   fi

   if [ ! -d "$2" ]
   then
      echo "Error: $2 is not a valid directory"
   fi

   exit 1
fi

# Removing the end slash from the arguments, if user had provided.
dir1=$(echo "$1" | sed 's/\/$//')
dir2=$(echo "$2" | sed 's/\/$//')

# Creating an array of files having ^ in their filenames.
filearr=( $(ls "$dir1"/*^* "$dir2"/*^*) )

# Getting filearr length.
filearrlen=${#filearr[@]}

# Creating an array of extensions.
for (( i=0; i<"$filearrlen"; i++ ))
do
   extarr+=(${filearr[i]##*^})
done

# Removing duplicates and the last extension from an extarr.
OLDIFS="$IFS"
IFS=$'\n'
newextarr=($(for i in "${extarr[@]}"; do echo "$i" | sed 's/\.[^.]*$//'; done | sort -du))
IFS="$OLDIFS"

# Getting newextarr length.
newextarrlen=${#newextarr[@]}

# Removing the previous outfile, if exists.
if [ -e "$resultfile" ]
then
   rm "$resultfile"
fi

# Joning the files vertically based on the extensions.
for (( i=0; i<"$newextarrlen"; i++ ))
do
   ext="${newextarr[i]}"
    echo "Handling ==> $ext"
   # Getting files with similar extensions.
   joinfiles=($(for j in "${filearr[@]}"; do echo "$j" | grep "\^$ext"; done))

   # Getting joinfiles array length.
   joinfileslen=${#joinfiles[@]}

   # Making a list of files to be pasted.
   for (( k=0; k<"$joinfileslen"; k++))
   do
      pastefiles+="${joinfiles[k]} "
        dos2unix "${joinfiles[k]}" 2>/dev/null
        cat "${joinfiles[k]}" | grep "^[ \t]*([0-9]* [0-9]*)," | sed 's/^[ \t]*//g'  | sort -t, -       k1 | cut -d',' -f1- >.ext_${k}_tags.csv
   done

   # Executing paste command.
   echo "==> ${ext}" >> "$resultfile"

awk 'BEGIN{ FS = "," }
{
if(FNR == NR){ a[$1] = $0 } else{ b[$1] = $0 }

for(i in a) { 
if (i in b) 
{ c[i]=a[i]", ,"b[i]; if (a[i] == b[i] ) { c[i]="True,"c[i]; } else { c[i]="False,"c[i]; } 
} else { c[i]="False,"a[i]", ,"i",MISSING-MISSING-MISSING";}
}
for(i in b) { 
if (! i in a) { c[i]="False,"i",MISSING-MISSING-MISSING, ,"b[i]; }
}
}
END{
for (i in c){ print c[i]; }
}
' ".ext_0_tags.csv" ".ext_1_tags.csv"|sort -t, -k1 >> "$resultfile"

rm -f ".ext_0_tags.csv" ".ext_1_tags.csv"

done

#---- End ---------------------------------------------------------------------

2 个答案:

答案 0 :(得分:1)

这是解决问题的一种方法:

awk -F, '{a[FNR]=a[FNR] sprintf("%s,%s,%s,%s,%s%s",$1,$2,$3,$4,($5==""?" ":$5),(NR==FNR?", ,":""))}
END{for(i=1;i<=FNR;++i)print a[i]}' file1.txt file2.txt

这使用数组将两个文件连接在一起。 %s语句中的sprintf采用列的值,如果第五列为空,则采用空格。如果正在处理第一个文件,则最后的%s将替换为逗号。处理完所有记录后,将打印数组的元素。

这里做了一些假设:假设只有第五列可以为空,并且两个文件中都有相同数量的记录。

输出:

1,1,1,1, , ,2,2,2,2,
3,3,3,3,3, ,4,4,4,4,4

答案 1 :(得分:1)

另一个awk

将字段分隔符和字段分隔符设置为,
如果少于5个字段将字段5设置为空格。 将数组设置为line。 如果第二个文件打印保存第二个文件中的行和行。

awk -F, -vOFS=, 'NF<5{$5=" "}{a[NR]=$0}FNR!=NR{print a[FNR]," ",$0}' file file2

1,1,1,1, , ,2,2,2,2,
3,3,3,3,3, ,4,4,4,4,4

我假设线上只有4个和5个字段,好像少于4个字段不会用空格填充所有空字段。 还假设只有两个文件。