我正在使用流编辑器sed将大量文本文件数据(400MB)转换为csv格式。
我已经非常接近完成,但突出的问题是引号内的引号,对于这样的数据:
1,word1,"description for word1","another text",""text contains "double quotes" some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for "word3"","another text","more text and more"
所需的输出是:
1,word1,"description for word1","another text","text contains double quotes some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"
我一直在寻求帮助,但我没有太接近解决方案,我尝试了以下seds的正则表达式模式:
sed -i 's/(?<!^\s*|,)""(?!,""|\s*$)//g' *.txt
sed -i 's/(?<=[^,])"(?=[^,])//g' *.txt
这些来自以下问题,但似乎不适用于sed:
原始文件是* .txt,我正在尝试使用sed编辑它们。
答案 0 :(得分:2)
以下是使用GNU awk
和FPAT变量的一种方式:
gawk 'BEGIN { FPAT="([^,]+)|(\"[^\"]+\")"; OFS=","; N="\"" } { for (i=1;i<=NF;i++) if ($i ~ /^\".*\"$/) { gsub(/\"/,"", $i); $i=N $i N } }1' file
结果:
1,word1,"description for word1","another text","text contains double
quotes some more text" 2,word2,"description for word2","another
text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"
说明:
使用FPAT,字段被定义为“任何不是a的东西” 逗号,“或”双引号,任何不是双引号的东西,和 关闭双引号“。然后在每一行输入上,循环遍历每一行 字段,如果字段以双引号开头和结尾,则删除所有字段 来自该领域的报价。最后,添加双引号 字段。
答案 1 :(得分:1)
sed -e ':r s:["]\([^",]*\)["]\([^",]*\)["]\([^",]*\)["]:"\1\2\3":; tr' FILE
查看"STR1 "STR2" STR3 "
类型的字符串并将其转换为"STR1 STR2 STR3"
。如果它找到了某些东西,它会重复,以确保它消除了深度> 1的所有嵌套字符串。 2。
它还确保STRx不包含comma
。