使用awk或sed打印用双引号括起来的CSV文件列

时间:2016-02-15 03:58:09

标签: bash csv awk sed

我正在处理类似下面的csv文件,以逗号分隔,每个单元格都用双引号括起来,但其中一些在双引号框内包含双引号和/或逗号。实际文件包含大约300列和200,000行。

"Column1","Column2","Column3","Column4","Column5","Column6","Column7"
"abc","abc","this, but with "comma" and a quote","18"" inch TV","abc","abc","abc"
"cde","cde","cde","some other, "cde" here","cde","cde","cde"

我需要删除一些除非列,并合并最后几列,而不是在它们之间使用",",我需要</br>。并将第二列移到最后。单元格内的任何内容都应相同,并使用双引号和逗号作为原始文件。下面是我需要的输出示例。

"Column1","Column4","Column5","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, "cde" here","cde</br>cde</br>cde","cde"

在这个例子中,我想删除column3并合并第5,6,7列。

下面是我尝试使用的代码,但它正在读取双引号和/或逗号,这是行的结尾与我预期的不同。

awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print $1,$4,$5"</br>"$6"</br>"$7",$2}' inputfile.csv

sed -i 's@"</br>"@</br>@g' inputfile.csv

sed用于删除单元格的开头和结尾双引号。

我现在正在获取的输出文件,如果前一个字段包含双引号,则会认为这是单元格的开头,因此以下值通常会向上推送一列。

我使用的其他代码将每个逗号视为单元格的开头,因此也不会起作用。

awk -F',' 'BEGIN{OFS=",";} {print $1,$4,$5"</br>"$6"</br>"$7",$2}' inputfile.csv

sed -i 's@"</br>"@</br>@g' inputfile.csv

非常感谢任何帮助。谢谢!

2 个答案:

答案 0 :(得分:2)

CSV是一种宽松的格式。格式化可能有细微差别。您的特定格式可能会或可能不会使用常规语法/正则表达式表达。 (有关此问题的讨论,请参阅this question。)即使您的特定格式可以用正则表达式表达,也可以更容易从现有库中剔除解析器。

它不是您可能想要或需要的bash / awk / sed解决方案,但Python有一个csv模块用于解析CSV文件。有许多选项可以调整格式。尝试这样的事情:

#!/usr/bin/python

import csv

with open('infile.csv', 'r') as infile, open('outfile.csv', 'wb') as outfile:
    inreader = csv.reader(infile)
    outwriter = csv.writer(outfile, quoting=csv.QUOTE_ALL)
    for row in inreader:
        # Merge fields 5,6,7 (indexes 4,5,6) into one
        row[4] = "</br>".join(row[4:7])
        del row[5:7]

        # Copy second field to the end
        row.append(row[1])

        # Remove second and third fields
        del row[1:3]

        # Write manipulated row
        outwriter.writerow(row)

请注意,在Python中,索引从0开始(例如row[1]是第二个字段)。切片的第一个索引是包含的,最后一个是独占的(row[1:3]row[1]row[2])。您的格式似乎需要在每个字段周围引用,因此quoting=csv.QUOTE_ALLDialects and Formatting Parameters还有更多选项。

上面的代码产生以下输出:

"Column1","Column4","Column5</br>Column6</br>Column7","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, cde"" here""","cde</br>cde</br>cde","cde"

这有两个问题:

  • 它不会以不同的方式处理第一行,因此第5,6和7列的标题与其他行合并。

  • 您的输入CSV包含"some other, "cde" here"(第三行,第四列),cde周围带有未转义的引号。在第二行还有另一种情况,但它已被删除,因为它在第3列。结果包含错误的引号。

如果这些引号被正确转义,则您的样本输入CSV文件将变为

infile.csv (转义引号)

"Column1","Column2","Column3","Column4","Column5","Column6","Column7"
"abc","abc","this, but with ""comma"" and a quote","18"" inch TV","abc","abc","abc"
"cde","cde","cde","some other, ""cde"" here","cde","cde","cde"

现在考虑这个修改过的Python脚本,它不会合并第一行的列:

#!/usr/bin/python

import csv

with open('infile.csv', 'r') as infile, open('outfile.csv', 'wb') as outfile:
    inreader = csv.reader(infile)
    outwriter = csv.writer(outfile, quoting=csv.QUOTE_ALL)
    first_row = True
    for row in inreader:
        if first_row:
            first_row = False
        else:
            # Merge fields 5,6,7 (indexes 4,5,6) into one
            row[4] = "</br>".join(row[4:7])
        del row[5:7]

        # Copy second field (index 1) to the end
        row.append(row[1])

        # Remove second and third fields
        del row[1:3]

        # Write manipulated row
        outwriter.writerow(row)

输出 outfile.csv

"Column1","Column4","Column5","Column2"
"abc","18"" inch TV","abc</br>abc</br>abc","abc"
"cde","some other, ""cde"" here","cde</br>cde</br>cde","cde"

这是您的示例输出,但已正确转义"some other, ""cde"" here"

这可能不是你想要的,不是sed或awk解决方案,但我希望它仍然有用。处理更复杂的格式可能会证明更复杂的工具。使用现有的库也可以消除一些出错的机会。

答案 1 :(得分:0)

这可能是对问题的过度简化,但这对我的测试数据起了作用:

cat /tmp/inputfile.csv | sed 's@\"\,\"@|@g' | sed 's@"</br>"@</br>@g' | awk 'BEGIN {FS="|"} {print $1 "," $4 "," $5 "</br>" $6 "</br>" $7 "," $2}'

请注意我在Mac上可能是因为我必须用引号将AWK脚本中的逗号包装起来。