如何将带有逗号的制表符分隔文件转换为.CSV,将逗号括起来的值用双引号括起来?

时间:2013-10-02 16:01:29

标签: linux csv sed awk tab-delimited

我有一个 .CSV 文件(比如说tab_delimited_file.csv),我从特定供应商的门户网站下载。当我将文件移动到我的一个Linux目录时,我注意到这个特定的 .CSV 文件实际上是一个制表符分隔文件,名为 .CSV < / strong>即可。请在下面找到该文件的几个示例记录。

"""column1"""   """column2"""   """column3"""   """column4"""   """column5"""   """column6"""   """column7"""  
12  455 string with quotes, and with a comma in between 4432    6787    890 88  
4432    6787    another, string with quotes, and with two comma in between  890 88  12  455  
11  22  simple string   77  777 333 22

以上样本记录由tabs分隔。我知道文件的标题非常奇怪,但这是我收到文件格式的方式。

我尝试使用 tr 命令将tabs替换为commas,但由于记录值中的额外逗号,文件完全混乱。我需要用逗号括起来的记录值用双引号括起来。我使用的命令如下。

tr '\t' ',' < tab_delimited_file.csv > comma_separated_file.csv    

这会将文件转换为以下格式。

"""column1""","""column2""","""column3""","""column4""","""column5""","""column6""","""column7"""
12,455,string with quotes, and with a comma in between,4432,6787,890,88
4432,6787,another, string with quotes, and with two comma in between,890,88,12,455
11,22,simple string,77,777,333,22

我需要帮助才能将示例文件转换为以下格式。

column1,column2,column3,column4,column5,column6,column7
12,455,"string with quotes, and with a comma in between",4432,6787,890,88
4432,6787,"another, string with quotes, and with two comma in between",890,88,12,455
11,22,"simple string",77,777,333,22

使用 sed awk 的任何解决方案都非常有用。

2 个答案:

答案 0 :(得分:2)

这将产生你要求的输出,但是不清楚我假设的标准是否适用于哪些字段放在引号中(任何包含逗号或空格),例如,实际上是你的想要用其他输入自己测试一下,看看:

$ awk 'BEGIN { FS=OFS="\t" }
  {
     gsub(/"/,"")
     for (i=1;i<=NF;i++)
         if ($i ~ /[,[:space:]]/)
             $i = "\"" $i "\""
     gsub(OFS,",")
     print
  }
  ' file
column1,column2,column3,column4,column5,column6,column7
12,455,"string with quotes, and with a comma in between",4432,6787,890,88
4432,6787,"another, string with quotes, and with two comma in between",890,88,12,455
11,22,"simple string",77,777,333,22

答案 1 :(得分:1)

使用的一种方式:

awk '
    BEGIN { FS = "\t"; OFS = "," }
    FNR == 1 {
        for ( i = 1; i <= NF; i++ ) { gsub( /"+/, "", $i ) }
        print $0
        next
    }
    FNR > 1 {   
        for ( i = 1; i <= NF; i++ ) {
            w = split( $i, _, " " )
            if ( w > 1 ) { $i = "\"" $i "\"" }
        }
        print $0
    }
' infile

它使用选项卡分割输入中的字段和逗号以在输出中写入。对于标题很简单,简单删除所有双引号。对于数据行,仅当拆分返回多个字段时,对于每个用空格分割的字段和用双引号括起来。

它产生:

column1,column2,column3,column4,column5,column6,column7  
12,455,"string with quotes, and with a comma in between",4432,6787,890,88  
4432,6787,"another, string with quotes, and with two comma in between",890,88,12,455  
11,22,"simple string",77,777,333,22