获取包含逗号和换行符的.csv列的awk

时间:2016-08-02 23:09:32

标签: linux bash awk

我在.csv列中有数据,有时包含逗号和换行符。如果我的数据中有逗号,我用双引号括起整个字符串。如何将该列的输出解析为.txt文件,并考虑换行和逗号。

不符合我命令的示例数据:

,"This is some text with a , in it.", #data with commas are enclosed in double quotes

,line 1 of data
line 2 of data, #data with a couple of newlines

,"Data that may a have , in it and
also be on a newline as well.",

这是我到目前为止所做的:

awk -F "\"*,\"*" '{print $4}' file.csv > column_output.txt

1 个答案:

答案 0 :(得分:1)

$ cat decsv.awk
BEGIN { FPAT = "([^,]*)|(\"[^\"]+\")"; OFS="," }
{
    # create strings that cannot exist in the input to map escaped quotes to
    gsub(/a/,"aA")
    gsub(/\\"/,"aB")
    gsub(/""/,"aC")

    # prepend previous incomplete record segment if any
    $0 = prev $0
    numq = gsub(/"/,"&")
    if ( numq % 2 ) {
        # this is inside double quotes so incomplete record
        prev = $0 RT
        next
    }
    prev = ""

    for (i=1;i<=NF;i++) {
        # map the replacement strings back to their original values
        gsub(/aC/,"\"\"",$i)
        gsub(/aB/,"\\\"",$i)
        gsub(/aA/,"a",$i)
    }

    printf "Record %d:\n", ++recNr
    for (i=0;i<=NF;i++) {
        printf "\t$%d=<%s>\n", i, $i
    }
    print "#######"

$ awk -f decsv.awk file
Record 1:
        $0=<,"This is some text with a , in it.", #data with commas are enclosed in double quotes>
        $1=<>
        $2=<"This is some text with a , in it.">
        $3=< #data with commas are enclosed in double quotes>
#######
Record 2:
        $0=<,"line 1 of data
line 2 of data", #data with a couple of newlines>
        $1=<>
        $2=<"line 1 of data
line 2 of data">
        $3=< #data with a couple of newlines>
#######
Record 3:
        $0=<,"Data that may a have , in it and
also be on a newline as well.",>
        $1=<>
        $2=<"Data that may a have , in it and
also be on a newline as well.">
        $3=<>
#######
Record 4:
        $0=<,"Data that \"may\" a have ""quote"" in it and
also be on a newline as well.",>
        $1=<>
        $2=<"Data that \"may\" a have ""quote"" in it and
also be on a newline as well.">
        $3=<>
#######

以上使用GNU awk进行FPAT和RT。我不知道任何CSV格式允许你在一个没有引号括起来的字段中间有一个换行符(如果它确实你不知道任何记录的结束)所以脚本不允许那。以上内容在此输入文件上运行:

$ cat file
,"This is some text with a , in it.", #data with commas are enclosed in double quotes
,"line 1 of data
line 2 of data", #data with a couple of newlines
,"Data that may a have , in it and
also be on a newline as well.",
,"Data that \"may\" a have ""quote"" in it and
also be on a newline as well.",