Question

我正在使用流编辑器sed将大量文本文件数据（400MB）转换为csv格式。

我已经非常接近完成，但突出的问题是引号内的引号，对于这样的数据：

1,word1,"description for word1","another text",""text contains "double quotes" some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for "word3"","another text","more text and more"

所需的输出是：

1,word1,"description for word1","another text","text contains double quotes some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"

我一直在寻求帮助，但我没有太接近解决方案，我尝试了以下seds的正则表达式模式：

sed -i 's/(?<!^\s*|,)""(?!,""|\s*$)//g' *.txt
sed -i 's/(?<=[^,])"(?=[^,])//g' *.txt

这些来自以下问题，但似乎不适用于sed：

Related question for SISS

原始文件是* .txt，我正在尝试使用sed编辑它们。

Answer 1

以下是使用GNU awk和FPAT变量的一种方式：

gawk 'BEGIN { FPAT="([^,]+)|(\"[^\"]+\")"; OFS=","; N="\"" } { for (i=1;i<=NF;i++) if ($i ~ /^\".*\"$/) { gsub(/\"/,"", $i); $i=N $i N } }1' file

结果：

1,word1,"description for word1","another text","text contains double
quotes some more text" 2,word2,"description for word2","another
text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"

说明：

使用FPAT，字段被定义为“任何不是a的东西” 逗号，“或”双引号，任何不是双引号的东西，和关闭双引号“。然后在每一行输入上，循环遍历每一行字段，如果字段以双引号开头和结尾，则删除所有字段来自该领域的报价。最后，添加双引号字段。

Answer 2

sed -e ':r s:["]\([^",]*\)["]\([^",]*\)["]\([^",]*\)["]:"\1\2\3":; tr' FILE

查看"STR1 "STR2" STR3 "类型的字符串并将其转换为"STR1 STR2 STR3"。如果它找到了某些东西，它会重复，以确保它消除了深度> 1的所有嵌套字符串。 2。

它还确保STRx不包含comma。

sed - 删除大型csv文件中引号内的引号

2 个答案: