使用awk解析带有变量文本和分隔符的分隔符

时间:2017-09-26 21:37:44

标签: bash csv awk sed

尝试使用csv格式的混合操作混乱的4GB txt文件。数据大约有38列和第39列。由分隔符'"'定义。 (示例如下)使用逗号作为字段分隔符导出数据,但也使用逗号内联数据,这使得难以导入到大多数平台。最后我相信使用awk / sed / cat我可以修复数据。可以使用引号来定义每列数据。我无法弄清楚如何。

我希望我的结束文件是由两组引用中的内容标识的列,所有逗号都替换为句点或类似内容。包含逗号的部分位于我的列的中间,并不是数据集中的最后一个字段。我试图使用逗号用awk删除该部分,使用sed替换它们,然后使用cat将其粘贴回文件。

实际数据是敏感的,无法共享,下面的例子也是如此。

数据样本:

"identifier","Status","Name","City","Application","Job","Details","column 39"
"red","paid","Dave","Philadelphia","55823","Cashier","No commas in this comment","spare1"
"rojo","past due","Steve","San Francisco","78434","trainer","Does not like sushi, beer, or ham","spare2"
"verde","pending","Duncan","Columbus","65478","CEO","Late for work, on the fifth","spare3"

所需结果的重点是更改逗号,并在第39列和第34列之后在内联或结尾添加数据。

"identifier","Status","Name","City","Application","Job","Details","column 39"
"red","paid","Dave","Philadelphia","55823","Cashier","No commas in this comment","spare1"
"rojo","past due","Steve","San Francisco","78434","trainer","Does not like sushi. beer. or ham","spare2"
"verde","pending","Duncan","Columbus","65478","CEO","Late for work. on the fifth","spare3"

非常感谢任何建议!

1 个答案:

答案 0 :(得分:0)

您可以使用sed删除内部逗号,例如

$ f1=$'"column 1","Column 2","Name","Address","Application","Job","Comments, about, items that also have, commas, inline","column 39"'

$ echo "$f1" |sed -r 's/([^"]),([^"])/\1\2/g'
"column 1","Column 2","Name","Address","Application","Job","Comments about items that also have commas inline","column 39"

或者您可以用其他内容替换内部逗号,稍后可以恢复为内部逗号:

$ f2=$(echo "$f1" |sed -r 's/([^"]),([^"])/\1-x2c-\2/g');echo "$f2"     "column 1","Column 2","Name","Address","Application","Job","Comments-x2c- about-x2c- items that also have-x2c- commas-x2c- inline","column 39"
#or use sed -r 's/([^"]),([^"])/\1.\2/g' to replace inner commas with dots

$ echo "$f2" |sed 's/-x2c-/,/g'
"column 1","Column 2","Name","Address","Application","Job","Comments, about, items that also have, commas, inline","column 39"

或者你可以使用一种awk来解析基于","的字段,而不仅仅是逗号:

$ echo "$f1" |awk -vFPAT='[^,]*|"[^"]*"' '{print $1}'
"column 1"

$ echo "$f1" |awk -vFPAT='[^,]*|"[^"]*"' '{print $7}'
"Comments, about, items that also have, commas, inline"

$ echo "$f1" |awk -vFPAT='[^,]*|"[^"]*"' -vOFS="," '{print $1,$7}'
"column 1","Comments, about, items that also have, commas, inline"